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HIERARCHICAL TESTING DESIGNS FOR PATTERN 
RECOGNITION 

By Gilles Blanchard 1 and Donald Geman 2 

CNRS and Fraunhofer FIRST, and Johns Hopkins University 

We explore the theoretical foundations of a "twenty questions" 
approach to pattern recognition. The object of the analysis is the 
computational process itself rather than probability distributions (Bayesian 
inference) or decision boundaries (statistical learning). Our formu- 
lation is motivated by applications to scene interpretation in which 
there are a great many possible explanations for the data, one ( "back- 
ground") is statistically dominant, and it is imperative to restrict 
intensive computation to genuinely ambiguous regions. 

The focus here is then on pattern filtering: Given a large set y of 
possible patterns or explanations, narrow down the true one Y to a 
small (random) subset Y C y of "detected" patterns to be subjected 
to further, more intense, processing. To this end, we consider a family 
of hypothesis tests for Y G A versus the nonspecific alternatives Y £ 
A c . Each test has null type I error and the candidate sets A C y 
are arranged in a hierarchy of nested partitions. These tests are then 
characterized by scope (|A|), power (or type II error) and algorithmic 
cost. 

We consider sequential testing strategies in which decisions are 
made iteratively, based on past outcomes, about which test to per- 
form next and when to stop testing. The set Y is then taken to be the 
set of patterns that have not been ruled out by the tests performed. 
The total cost of a strategy is the sum of the "testing cost" and the 
"postprocessing cost" (proportional to \Y\) and the corresponding 
optimization problem is analyzed. As might be expected, under mild 
assumptions good designs for sequential testing strategies exhibit a 
steady progression from broad scope coupled with low power to high 



Received March 2003; revised June 2004. 

1 Supported in part by a grant from the Humboldt Foundation and the 1ST Programme 
of the European Community under the PASCAL Network of Excellence, IST-2002-506778. 

Supported in part by ONR Contract N000120210053, ARO Grant DAAD19-02-1-0337 
and NSF ITR DMS-02-19016. 

AMS 2000 subject classifications. Primary 62H30, 62L05, 68T10; secondary 62H15, 
68T45, 90B40. 

Key words and phrases. Classification, sequential hypothesis testing, hierarchical de- 
signs, coarse-to-fine search, pattern recognition, scene interpretation. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Statistics, 
2005, Vol. 33, No. 3, 1155-1202. This reprint differs from the original in 
pagination and typographic detail. 



1 



2 



G. BLANCHARD AND D. GEMAN 



power coupled with dedication to specific explanations. In the as- 
sumptions ensuring this property a key role is played by the ratio 
cost/power. These ideas are illustrated in the context of detecting 
rectangles amidst clutter. 

1. Introduction. Motivated by problems in machine perception, specif- 
ically scene interpretation, we investigate the theoretical foundations of an 
approach to pattern recognition based on adaptive sequential testing. The 
basic scenario is familiar to everybody — identify one "pattern" (or "explana- 
tion") from among many by posing a sequence of subset questions. In other 
words, play a game of "twenty questions." Intuitively, we should ask more 
and more precise questions, progressing from general ones which "cover" 
many explanations, but are therefore not very selective, to those which 
are highly dedicated and decisive. Although the efficiency of coarse-to-fine 
(CTF) search drives the design of codes and many numerical routines, there 
has been surprisingly little work of a theoretical nature outside information 
theory to understand why this strategy is advantageous. We explore this 
question within the framework of sequential hypothesis testing, putting the 
emphasis on the modeling and optimization of computational cost: In what 
sense and under what assumptions are the strategies which minimize total 
computation CTF? 

Needless to say, in order to have a feasible formulation of the problem one 
must make specific assumptions about the structure of the available tests 
(or "questions"). In this paper, we will therefore consider a particular struc- 
ture based on an a priori multiresolution representation for the individual 
patterns and a corresponding hierarchy of hypothesis tests. Other important 
assumptions concern the statistical distribution of the tests and how cost 
varies with scope and power. 

Our formulation is influenced by applications to pattern recognition, al- 
though we believe it remains sensible for other complex search tasks and we 
would argue that computational efficiency and CTF search are linked in a 
fundamental way. In both natural and artificial systems, many tasks do not 
require immediate, complete explanations of the input data. Nonetheless, 
the usual approach to machine perception is static: Intermediate results, 
when they exist, generally do not provide clear and useful provisional ex- 
planations. In contrast, we consider a sequence of increasingly precise inter- 
pretations (subsets of patterns), noting that experiments in biological vision 
(e.g., studies on "pop-out") report evidence for graded interpretations, for 
example, very fast identification of visual categories [27], "visual selection" 
and "regions of interest" [11]. 

Our formulation is also influenced by what we perceive to be some funda- 
mental limitations in purely learning-based methods in pattern recognition 
in spite of recent advances (e.g., multiple classifiers, boosting and theoretical 
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bounds on generalization error). We do not believe that very complex prob- 
lems in machine perception, such as full-scale scene interpretation, will yield 
directly to improved methods of statistical learning. Some organizational 
framework is needed to confront the sheer number of explanations and com- 
plexity of the data (see, e.g., the discussion in [18]). In our approach learning 
comes into play in actually constructing the individual hypothesis tests from 
training data; in other words, one learns the individual components of an 
overall design. 

The hypothesis-testing framework is as follows. Consider many patterns 
(or pattern classes) y G y as well as a special, dominating class which 
represents "background." There is one true state Y G {0} U y. In the high- 
lighted applications, Y refers to a semantic explanation of image data, for 
instance, the names and poses (geometrical presentations) of members be- 
longing to a repertoire of actual objects appearing in an image. Thus, for 
example, a "pattern" might be a particular instance of a shape, say a square 
at some specific scale and orientation. The explanation Y = represents "no 
pattern of interest" and is exceedingly more likely a priori; class is also ex- 
ceedingly more varied. Ultimately, we want to determine Y (classification or 
identification). Ideally, this task would be accomplished rapidly and without 
error. 

However, in machine perception and many other domains, near-perfect 
classification is often very difficult, even with sizable computational re- 
sources, and virtually impossible without resorting to a "contextual anal- 
ysis" of competing explanations. In other words, we eventually need to test 
precise hypotheses Y G A against precise alternatives Y G B, where A, B C y 
("is it an apple or a pear?"). In view of the large number of possible expla- 
nations, it is not computationally feasible to anticipate all such scenarios. 
This argues for starting, and going as far as one can, with a "noncontextual 
analysis," meaning testing the hypothesis Y G A against the nonspecific al- 
ternative Y G" A (or, what is often almost the same, against the background 
alternative Y = 0) for a distinguished family of subsets A C y. Of course 
this only makes sense if there are natural groupings of explanations, which 
is certainly the case for pattern recognition (e.g., involving real objects and 
their spatial presentations). 

Let Xa denote the result of such a test, with Xa = 1 (resp. Xa = 0) indi- 
cating acceptance (resp. rejection). Indeed, it then makes sense to construct 
a family X of such tests in advance, say of order C(|3^|)- Throughout the 
paper we assume that the family A of sets Acy for which (noncontextual) 
tests are built has a hierarchical, nested cell structure. These sets will be 
called attributes and their cardinality called their scope. In this scheme, the 
contextual analysis — testing against specific alternatives — begins only af- 
ter the number of candidate explanations is greatly reduced, at which point 
tests may be created on-line to address the specific ambiguities encountered. 
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To pin things down, consider a toy example: Suppose y = {a,p}, standing 
for apple and pear, and Y = stands for other, the most likely explanation. 
Suppose also there are four "tests" : 

(i) I{„ p } for testing Y G {a,p} versus Y = (something like "Is it a 
fruit V); 

(ii) X{ a y (resp. Xr p }) for testing Y = a versus Y = (resp. Y = p vs. 

y = o) ; 

(hi) X{ avp y for testing Y" = a versus Y =p. 

Tests Xf a>p \ ,Xf a } ) are "noncontextual" ; X^ avp y is "contextual." Sup- 
pose all noncontextual tests have null false negative error. The type of CTF 
strategy that typically emerges from minimizing the "cost" of determining 
Y under natural assumptions about how cost, scope and error are balanced 
is the intuitively obvious one: Perform X{ ap y first; then, if the result is pos- 
itive (Xi ap \ = 1), perform X[ a } and X^; finally, perform X^ avp y if both 
the previous results are again positive. 

In this paper we consider efficient designs for the noncontextual phase 
only; the full problem, including contextual disambiguation, will be analyzed 
elsewhere. However, we anticipate the complexity of this contextual analysis 
by incorporating into our measurement of computation a "postprocessing" 
penalty which is proportional to the number of remaining explanations. 

Our objective, then, is efficient "pattern filtering." The reduced set of 
explanations after noncontextual testing, denoted by Y and called the set 
of filtered patterns (or detected patterns), is a random subset of y that also 
depends on the chosen strategy, that is, the sequence of tests chosen to 
be performed. The tests are performed sequentially, and the choice of the 
next test to perform (or the decision to stop the search) depends on the 
outcomes of the past tests and is prescribed by the strategy. If strategy T 
has performed the tests Xa x , • • • , X^ h before terminating (note that k and 
Ai, . . . , Ak are themselves random variables), then the set of filtered patterns 
is determined in a simple way from the outcomes of the tests: Y(T) consists 
of all patterns y G y which are "accepted" by every test Xa+ for which 
y G Ai, 1 < i < k. In other words, a pattern is said to be filtered if it is not 
ruled out by one of the tests performed. 

The fundamental constraint is no missed detections: 

P(YgYU{0}) = 1. 

This condition is satisfied if each individual test Xa has zero type I error, 
and we make this assumption about every test Xa, recognizing that we must 
pay for it in terms of cost and power {or equivalently type II error). Although 
we shall not be explicitly concerned with standard estimators such as 



%&le{X) =argmaxP(;f|Y = y) and Yuap(X) = argmaxP(y = y\X), 
y y 
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or even formulate a prior distribution for Y, it then follows that 
P(Ymle G Y U {0}) = p(y map G Y U {0}) = 1. 

Tests Xa G X are then characterized by their scope (|^4|), power [P(Xa = 
0\Y ^ A)] and computational cost, and certain fundamental trade-offs are 
assumed to hold among these quantities. In order to accommodate differ- 
ing applications and establish general principles, we will consider several 
scenarios, including both "fixed" and "variable" powers and two models — 
"power-based" and "usage-based" — for how the cost of a test is determined. 
Only the power-based cost model will be considered in detail; an analysis 
of the usage-based model can be found in [6] . Two other basic assumptions 
we make are (i) mean computation is well approximated by conditioning on 

Y = 0; and (ii), in that case, the tests are conditionally independent. 
Except for a concluding illustration, we do not consider how these hy- 
pothesis tests Xa are actually constructed, that is, depend functionally on 
the raw data. In the applications cited in Section 8 this typically involves 
statistical learning, for instance, inducing a decision tree or support vector 
machine from positive (Y £ A) and negative (Y ^ A) examples. We are de- 
signing the specifications rather than the tests themselves, and modeling the 
computational process rather than learning decision boundaries for classifi- 
cation. Presumably standard techniques can be used to build tests to the 
desired specifications if the trade-offs are reasonable. In Section 8 we will 
mention one recipe in an image analysis framework. 

Although we will assume throughout that the true Y is a single pattern 
belonging to {0} L>y, our analysis would remain valid if we allowed Y to be 
an entire subset of patterns Y C y (with Y = representing "no pattern 
of interest" or "background"). In this case, Xa would test the hypothesis 

Y fl A 7^ against Y n A = 0, or against the nonspecific alternative Y = 
0. This setting might be more useful in some applications, such as scene 
interpretation, although in the end these subsets are simply more complex 
individual explanations. 

Finally, our work is a natural outgrowth of an ongoing project on scene 
analysis (especially object recognition) which has been largely of an algo- 
rithmic nature (see, e.g., [2]). The current objective is to explore a suitable 
mathematical foundation. This was begun in [13] and [14] where the com- 
putational complexity of traversing abstract hierarchies was analyzed in the 
context of purely power-based cost — assuming that cost is an increasing, 
convex function of power. It was continued in [20], in which the optimal- 
ly of depth-first CTF search for background-pattern separation [checking if 
Y(X) = 0] was established under the same model. The cost model here is 
more realistic because cost depends on scope as well as power. 
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Index of Main Notation 


Objects: 




y 


set of all possible patterns or explanations 


Yeyu{o} 


true (data-dependent) pattern (0 means background) 


Po(-) 


= P(-\Y = 0), the background distribution 


Attributes: 




A 


a grouping of objects (a.k.a. attribute) 


A 


hierarchy of attributes 


A 


"augmented" hierarchy of attributes (see Section 4.5.4) 


Z(A) 


coverings of A: LUez = ^ for all Z G Z(A) 


A x 


coarsest attribute(s); root in the tree-structured case 


Tests: 




X 


binary random variable 


P(X)€[0,1] 


= P (X = 0), power of X 


c(X) G [0, oo) 


cost of X 




test for attribute A with power (3 


X 


family of tests indexed by A; "fixed (powers) hierarchy" 



X family of tests indexed by A,f3; "variable-power hier- 

archy" 

(3(A); c(A) power and cost of Xa (fixed hierarchy case) 

r increasing, subadditive complexity function for power- 

based cost 

^> increasing, convex power function for power-based cost; 

*(0) = 0,*(1) = 1 

Strategies: 

T labeled binary tree, T° denotes internal nodes of T 

X(s); A(s);P(s) test at interior node s of T; attribute and power of this 
test 

X(t) set of tests along the branch leading to node t of T 

Y(t) C y surviving (filtered) explanations at terminal node t of T 

Y(T) filtered set of objects (surviving explanations) after 

testing 

qx{T) probability of performing X in T under Pq 

C(T) = C tcst (T) + C post (y(T)): total cost 

Ctest(T) random variable, sum of the costs of the tests per- 

formed in T 

C pos t(Y(T)) = c*\Y(T)\, random variable, postprocessing cost 
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2. Organization of the paper. In Section 3 we provide a nontechnical 
overview of the results obtained in the paper. The precise mathematical 
setup appears in Section 4. 

Our principal results appear in Sections 5-7. In Section 5, we consider 
the simplest case: There is one single test Xa of fixed power and cost for 
each attribute A £ A, and we present a fairly general sufficient condition 
under which CTF strategies are optimal. The "variable-power hierarchy" 
is examined in Section 6, namely a whole family of tests (Xa p) for each 
attribute A E A indexed by their power (3. As the results for variable powers 
are decidedly not comprehensive, we attempt to strengthen the case for 
the "optimality" of CTF search with a variety of simulations at the end of 
Section 6. In Section 7, we mention a few analytical results for a substantially 
different cost model in which the cost of a test depends on the frequency 
with which it is used; this section amounts to a summary of results in [6] . 

In order to see how all this plays out in practice, we illustrate a few pre- 
vious applications of this methodology to scene interpretation in Section 8. 
We also sketch an algorithm in Section 8 for a synthetic example of detect- 
ing rectangles in images against a background of "clutter"; the purpose is 
to illustrate in a controlled setting the quantities which figure in our analy- 
sis, especially how computation is measured and tests are constructed from 
data. Finally, in Section 9, we discuss some connections with related work 
and decision trees, critique our results and indicate some directions for future 
research. 

3. Overview of results. A strategy T can be represented as a binary 
tree with a test X G X at each internal node and a subset Y(t) at each 
external node or leaf t. The computational cost due to testing, Ct cs t(T), is 
a random variable — the sum of the costs of the tests performed before Y 
is determined. The mean cost is then the average over all tests X £ X of 
the cost of X weighted by the probability that X is performed in T; these 
quantities will be defined more carefully in Section 4. 

In anticipation of resolving the ambiguities in Y in order to determine Y , 
we add to the mean testing cost a quantity which reflects the postprocessing 
cost, taken simply as C pos t(Y(T)) = c*\Y(T)\, where c* is a constant called 
the unit postprocessing cost. This charge may also be (formally) interpreted 
as the cost of performing perfect, albeit costly, tests for each individual 
nonbackground explanation in Y in order to remove any remaining error 
under the background hypothesis [i.e., render P(Y = 0\Y = 0) = 1]. The 
constant c* then represents the cost of a perfect individual test. Again, all 
tests have null false negative error, so "perfect" refers to full power. 

The natural optimization question is then to find the strategy T* which 
minimizes the mean total computation: 

T* = argmin£[C(T)], C(T) = C test (T) + C post (Y(T)). 

T 
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We are particularly interested in determining when T* is CTF in scope 
(meaning scope is decreasing along any root-to-leaf branch) and CTF in 
power (meaning power increases as scope decreases). Informally, the assump- 
tions we impose are: 

(a) A multiresolution, nested cell representation: The family of attributes 
A has the structure of a tree (see, e.g., Figure 1). 

(b) Background domination: Mean computation E[C(T)] and power P(Xa = 
Q\Y ^ A) are well approximated by taking P = Pq = P(-\Y = 0). 

(c) Conditional independence: Under Pq families of tests over distinct 
attributes are independent. This is the strongest assumption and the one 
most likely to be violated in practice. 

In the case of a fixed-powers hierarchy considered in Section 5, we assume 
that the test for attribute A has cost c(A) and power (3(A). We show that 
the ratios c(A) / [3(A) play a crucial role in the analysis of the optimization 
problem, and give the following general sufficient condition: CTF optimal- 
ity holds whenever, for any attribute A, the ratio of cost to power is less 
than the sum of the corresponding ratios over all direct children of A in 
the test hierarchy (including if necessary the perfect tests representing the 
postprocessing cost, having cost c* and power 1). 

In the case of a variable-power hierarchy (Section 6), we consider a mul- 
tiplicative model for the cost of Xa,{3'- c(Xa,p) = r(|A|) x ^((3), where T is 
subadditive and is convex. We prove that the CTF strategies always per- 
form a specific test with the same power and that this power does not depend 
on the particular CTF strategy. A rigorous result about CTF optimality is 
only obtained for one particular but simulations strongly indicate that 
the observed behavior is more widely true. In summary, CTF strategies seem 
to be optimal for a wide range of situations. The same can be said under 
the "usage-based" cost model in Section 7. 

4. Problem formulation. In this section we formulate efficient pattern 
filtering as an appropriate optimization problem. (Recall that we are using 
the word "pattern" for an "explanation," often quite specific, rather than 
in the sense of some equivalence class of concepts or shapes.) We define 
the fundamental quantities which appear in this formulation, including at- 
tributes, tests and strategies, and how cost is measured both for individual 
tests and for testing designs. We also state our main assumptions about the 
test statistics and the relationships among cost, power and invariance which 
drive the optimization results in Sections 5-7. 

4.1. Goals. The background probability space O represents the raw data — 
collections of numerical measurements — and y denotes a set of patterns (or 
classes or explanations). We imagine the patterns y E y to be rather precise 
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interpretations of the data and consequently \y\ to be very large. There is 
also a special explanation called background, denoted by 0, which represents 
"no pattern of interest" and is typically the most prevalent explanation by 
far. 

We suppose there is a true state Y which takes values in {0} U y and 
which, for simplicity, is determined by the raw data. In other words, we re- 
gard Y as a random variable on Q. Most of what follows could be generalized 
to the case in which Y cy and Y = represents background. 

Example. In the context of machine perception, the raw data represent 
signals or images and the explanations represent the presentations of special 
entities, such as words in acoustical signals or physical objects in images 
(e.g., face instantiations or printed characters at a particular font and pose). 
The level of specificity of the explanations is problem-specific. However, we 
do assume that the data have in fact a unique interpretation at the level of 
precision of Y. Clearly this assumption eventually breaks down in the case 
of highly detailed semantic descriptions — at some point the subjectivity of 
the observer cannot be ignored. 

The ultimate goal is pattern identification: Determine Y. However, for the 
reasons stated earlier, we shall focus instead on: 

Pattern filtering . Reduce the set of possible explanations to a relatively 
small, data-driven subset Y <zy such that Y £ Y U {0} with probability (al- 
most) 1. 

We shall also consider the special case of spotting one single, fixed pat- 
tern y*. A related problem of interest is background-pattern separation or 
background filtering: Determine whether or not Y = 0. Background-pattern 
separation will not be analyzed in this paper since it has been studied else- 
where in a very similar framework in [13, 14, 20]. In contrast, detecting a 
single pattern of interest will often serve as a first step before turning to the 
filtering of all possible patterns. Formally, what will distinguish these tasks 
is only the postprocessing cost; see Section 4.5.2. 

As discussed earlier, the rationale behind pattern filtering is that requir- 
ing that Y G Y U {0} ensures, by definition, that no pattern is missed. Hence, 
the ensuing analysis, which is aimed at determining Y with high precision 
and is likely to be computationally intensive, can be limited to Y. Additional 
computation might involve a contextual analysis, such as constructing hy- 
pothesis tests on the fly for distinguishing between competing alternatives 
belonging to Y. This "postprocessing stage" will not be analyzed in this 
paper, except that we shall explicitly anticipate additional computation in 
the form of a penalty for unfinished business: We impose a "postprocessing 
cost" C pos t(Y) proportional to the size of Y. The goal then is to find an 
optimal trade-off between the costs of "testing" and "postprocessing." 
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4.2. Attributes and attribute tests. Any subset of patterns A<zy can be 
regarded as an "interpretation" of the data and we assume there are certain 
"natural groupings" of this nature (e.g., "writer" in a "Guess Who" version 
of twenty questions, "noun" in speech recognition and "character" in visual 
recognition). We call these distinguished subsets attributes and we denote 
the family of attributes by A (a collection of subsets of y) and suppose \A\ 
is of order C(|3^|)- For every y & y, we will assume that 

(i) {y} = n A 

A3y 

One of our main assumptions is that A has a multiresolution, hierarchical 
structure with attributes at varying levels of precision. Formally, we assume 
that 

VA,A'eA, An A' => (A' c A) or {Ac A'). 

Note that the set of attributes thus has a tree structure (see Figure 1 for an 
example). Furthermore, assumption (1) implies that the set of leaves of the 
corresponding tree is exactly the set of all singleton attributes. 

For every attribute A E A we can build one or more binary tests X— 
the result of testing the hypothesis Y E A against either Y £ A or Y = 0; 
the value X = 1 corresponds to choosing Y 6 A and X = to choosing 
the alternative. Which alternative, Y £ A or Y = 0, is more appropriate 
is application-dependent. For example, in inductive learning, the two cases 
correspond to the nature of the "negative" examples in the training set — 
whether they represent a random sample under Y £ A c or under "back- 
ground." In the applications cited in Section 8, the tests are constructed 
based entirely on the statistical properties of the patterns in A; neither al- 
ternative is explicitly represented. Due to the domination of the background 
class, at least at the beginning of the search, and due to the simplification 
afforded by measuring total computation cost under Pq = P(-\Y = 0), the 




Fig. 1. Example of a (nonregular) tree- structured hierarchy of attributes. 
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alternative hypothesis will hereafter be Y = and we define the power of 
the test accordingly: 

P(X) = P(X = 0\Y = 0). 

In order to make the notation more informative, we shall write either X A 
to indicate the attribute being tested or X A g to signal both the attribute 
and the power. 

The first main assumption we will make about these tests is that their false 
negative rate is negligible. In other words, if a pattern (i.e., nonbackground 
explanation) is present, then any attribute test which covers this pattern 
must respond positively: 

(2) P(X A = 1\Y eA) = l V4gA 

For this reason, and due to the origins of this work in visual object recogni- 
tion, we sometimes refer to the size of A as the level of invariance of X A g, 
but usually just as the scope. Its depth in the attribute hierarchy is called 
level of resolution. In general, however, 

P(X A = l\Y = y)>0 Vy€{0}U^\A. 

In other words, the tests are usually not perfect or two-sided invariants. 

Formally, assumption (2) is not necessary for the mathematical results in 
the coming sections to hold, because we will only make computations under 
the "background probability" (when Y = 0); see Section 4.4. However, this 
assumption is necessary for our formulation of pattern identification to make 
sense; indeed, it implies that if one has performed tests X Al , ■ ■ ■ , X Ak , then 
necessarily 

Yey\ |J A k . 
k-.x Ak =o 

We will say that the patterns above have been filtered by tests X Al , . . . , X Ak 
and focus on sequential testing designs for which the chosen Y is the set of 
patterns filtered by all the tests actually performed (called filtered patterns 
for short). This choice coheres with our requirement that Y G Y U {0} with 
probability 1, while at the same time ensuring that Y is of minimum size 
given the available information. 

Finally, each test X A ,/3 has a cost or complexity c(X A ,p) which represents 
the amount of online computation (or time) necessary to evaluate X A) p. In 
Section 4.6 we shall consider a cost model in which the cost of a test is a 
predetermined quantity related to power and scope. In Section 7 we briefly 
consider another "usage-based" cost model. 



12 



G. BLANCHARD AND D. GEMAN 



4.3. Test hierarchies. We consider two types of families of tests, one with 
exactly one test (at some fixed power) per attribute and referred to as a fixed 
test hierarchy, and one with a one-parameter family of tests {Xa,/3,0 < (3 < 
1} for each A £ A indexed by power and referred to as a variable-power 
hierarchy. 

4.3.1. Fixed hierarchy. We will denote such a hierarchy by X = {Xa, A £ 
A} and write (3(A) for the power of Xa and c(A) for its cost. Optimal testing 
strategies for fixed hierarchies is the subject of Section 5 and Section 7 for 
two different cost models. In the analysis in those sections a central role is 
played by the (random) set Y(X) of patterns which are filtered by all the 
tests in X , that is, those patterns which are verified at all levels of resolution. 
More precisely: 

Y(x) = y\\J{AeA\x A = o}. 

Recall that under our constraint on the false negative error, we necessarily 
have P(Y G {0}L)Y(X)) = 1. Clearly, Y(X) leads to a smaller postprocessing 
cost than any Y based on only some of the tests in the hierarchy, but, of 
course, requires more computation to evaluate in general. 

4.3.2. Variable-power hierarchy. The variable-power hierarchy is 

X = {X A ,p\AeA > (3e[0,l}}. 

In Section 6 we will consider testing strategies in which, at each step in a 
sequential procedure, both an attribute and a power may be selected. This 
clearly leads to a more complex optimization problem and our results in this 
direction are correspondingly far less complete than those in the case of a 
fixed hierarchy. From another point of view, extracting a subset of tests from 
a variable-power hierarchy (e.g., specifying a testing strategy) is a type of 
model selection problem. 

4.4. The probabilistic model. In order for the upcoming optimization 
problems to be well defined, we need to specify the joint distribution of 
the random variables in X. 

The first hypothesis we make is that we are going to measure mean compu- 
tation relative to Pq(-) = P(-\Y = 0) — the "background distribution." This 
is justified by the assumption that a priori the probability of the explana- 
tion Y = is far greater than the compound alternative Y ^ let alone any 
single, nonbackground explanation. For instance, in visual processing a ran- 
domly selected subimage is very unlikely to support a precise explanation 
in terms of visible patterns; in other words, most of the time all we observe 
is clutter. 
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The second hypothesis we make is that, under Pq, any family of tests 
X Ax, /hi ■ • ■ i^-A k ,p k fc> r distinct attributes Ai,...,Ak is independent. This is 
probably the strongest assumption in this paper but is not altogether unrea- 
sonable under Pq in view of the structure of A since two distinct tests are 
either testing for disjoint attributes (if they are at the same level of resolu- 
tion) or testing for attributes at different levels of resolution. In Section 5 we 
shall briefly consider simulations for a nontrivial dependency structure — a 
Markov hierarchy. 

No assumptions are made about the dependency structure among tests 
for the same attribute but at different powers. Instead, the assumption to 
be made in the following section that no attribute can be tested twice in the 
same procedure allows us to compare the cost of testing strategies regardless 
of this dependency structure. 

4.5. Testing strategies and their cost. We consider sequential testing pro- 
cesses, where tests are performed one after another and the choice of the 
next test to be performed (or the decision to stop the testing process) can 
depend on the outcomes of the previously performed tests. We will make 
the important assumption that in any sequence of tests, a given attribute 
can only be tested once. 

Definition 1 (Testing strategy). A strategy is a finite labeled binary 
tree T where each internal node t G T° is labeled by a test X(t) = X^i^Mt) 
and where A(t) ^ A(s) for any two nodes t, s along the same branch. At 
each internal node t the right branch corresponds to X(t) = 1 and the left 
branch to X(t) = 0. 

The restriction to at most one test per attribute A along any given branch, 
whereas of course automatically satisfied in the case of a fixed hierarchy 
(Sections 5 and 7), does limit the set of possible strategies for a variable- 
power hierarchy since several tests X^^ of varying power are available for 
each attribute A. In that case the purpose of this assumption is essentially 
to simplify the analysis by guaranteeing that all the tests actually performed 
are independent. 

The leaves (terminal nodes) of T will be labeled in accordance with the 
answers to the tests: Every leaf of T is labeled by the subset Y C y of filtered 
patterns that have not been ruled out by the tests performed by the strategy 
(along the branch leading to this leaf). In other words, for any strategy T 
and leaf s of T, if X(s) denotes the set of tests along the branch leading to 
s, we put 



Y(s) = y\ \J{A G A\X A G X(s); X a = 0}. 
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The random set Y(T) is then defined by interpreting T as a function of 
the tests which takes values among its leaves. However, how the leaves are 
labeled is irrelevant for the purposes of defining the testing cost C tes t (T) of 
a strategy; it will only influence the postprocessing cost C pos t(Y). 

4.5.1. Cost of testing. There are several equivalent definitions of the test- 
ing cost of T, another random variable. One is 

C test (T)=Y,c(X(t))l Ht , 

teT° 

where Ht is the history of node t — the event that t is reached. Recall that T° 
is the set of internal nodes of T. This is clearly the same as aggregating the 
costs over the branch traversed or adding the costs of all tests performed. 
Given a probability distribution P on !1, and in particular P = Pq, two 
equivalent expressions for the mean cost are then 

(3) E [C test (T)] = J2 c(X(t))P (H t ) = Y,c(X)q x (T), 

teT° x 

where 

qx(T) = P {X performed in T) = ^ l {x{t)=X }Po(H t ). 

Expression (3) is particularly useful in proving some of our results; in Sec- 
tion 5 we will transform it into yet another expression that will anchor the 
analysis there. 

4.5.2. Cost of postprocessing. It is natural to define the postprocessing 
cost in the following, goal-dependent manner: 

(i) Filtering a special pattern: C pos t(Y(T)) = c *^-{ y *^Y(r)} wnere V* ls 
the target pattern. 

(ii) Filtering all patterns: C post (Y(T)) = c*\Y(T)\ . 

Here c* is some constant called the unit postprocessing cost. 

In the case of a single target pattern, note that this choice of postpro- 
cessing cost naturally leads us to disregard any attribute not containing the 
target y* as those tests are irrelevant to the goal at hand and can only aug- 
ment the total cost. Consequently, the set of relevant attributes reduces to 
a "vine" A\ D Ai D • • • D Al. In this case, choosing a testing strategy boils 
down to choosing a subset of these relevant attributes and an order in which 
to test for them. If a test returns a null answer, the search terminates with 
the outcome y* ^Y and there is no postprocessing charge; on the other 
hand, if all the selected tests respond positively, then y* £ Y is declared 
(which still may not be true) and the charge is c*. In particular, the testing 
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Fig. 2. Left: A vine-structured hierarchy of attributes for detecting one pattern. Right; 
An example of a vine-structured testing strategy for this hierarchy. 

strategy T itself has in this case the structure of a vine (see Figure 2). In 
contrast, in the case of general pattern filtering the testing strategies are of 
course tree-structured. 

4.5.3. Optimization problem. The total computational cost for the task 
at hand is Ct es t(T) + C pos t(Y(T)). The corresponding optimization problem, 
our central focus, is then to find a strategy attaining 

(4) mm(E [C tcst (T)} + E [C post (Y(T))]), 

where T is the family of all strategies. We emphasize that in the case of 
variable-power hierarchies we are therefore optimizing over both power and 
scope. 

4.5.4. Equivalent model with perfect tests. There is an equivalent way 
to interpret the postprocessing cost which is technically more convenient. 
We can think of c* as the cost of performing a perfect test (i.e., without 
errors under Po) f° r an Y individual pattern. Therefore, the postprocessing 
cost model is formally equivalent to supposing there is no postprocessing 
stage, but that no errors (under Po) are allowed at the end of the proce- 
dure, enforced by performing, as needed, some additional perfect tests at 
the end of the search. Since we have assumed that no attribute, and in par- 
ticular no singleton {y}, cannot be tested at two different powers along the 
same branch, we can incorporate perfect testing into the previous frame- 
work simply by adding a final layer to the original hierarchy A which copies 
the original leaves, thereby accommodating a battery of perfect singleton 
tests having cost c*. (Conditional independence is actually maintained since 
the new tests are deterministic under Po-) We denote by A the resulting 
augmented hierarchy. [Due to this augmentation there is a slight abuse of 
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notation when identifying an attribute with a subset of y, since in the 
augmented hierarchy we would like (in order to be entirely consistent) to 
consider some attributes as distinct although they correspond to the same 
set {y}. However, we will stick to the notation introduced before in order to 
avoid cumbersome changes.] 

This formal construction allows us to include the postprocessing cost in 
the testing framework. Furthermore, in the augmented model it is not diffi- 
cult to show that for any strategy T there exists a strategy T' performing 
exactly the same tests, but with the perfect tests performed at the end only, 
so that the optimization problem is in fact unchanged by allowing the perfect 
tests to be performed at any time. In summary, the equivalent optimization 
problem is to minimize the amount of computation necessary to achieve no 
error under Pq based on the augmented hierarchy. 

4.6. Cost of a test. There are certain natural trade-offs among cost, 
power and invar iance: 

(a) At a given cost, power should be a decreasing function of invariance. 

(b) At a given power, cost should be an increasing function of invariance. 

(c) At a given invariance, cost should be an increasing function of power. 

In Section 5, we will first deal with a generic setting where the test asso- 
ciated to a given attribute A has power f3(A) and cost c(A). In Section 6 we 
will use a more specific model reflecting the trade-offs among cost, power 
and invariance mentioned above: 

(5) c(X Ap )=T(\A\)xt>(f3), 

where the complexity function T is subadditive and the power function 
is convex. Consequently, we evaluate the cost of a test much like the merit 
of a dive in the Olympics: at any given level of difficulty (T) a score ( 1 F) 
is assigned based on performance alone. For normalization, we can assume 
that r(l) = 1. Then with the equivalent model where the postprocessing cost 
is replaced by "perfect" tests in mind, it is consistent to assume c* = ^(1). 
This multiplicative model is supported (at least roughly) by what is observed 
in actual experiments (see Section 8). 

One special case, treated in Section 6, is T(n) = n, that is, the complexity 
is simply the level of invariance. This case is the least favorable to CTF 
strategies since, in effect, no "credit" is given for shared properties among 
two disjoint attributes A,B € A. If, for instance, |^4| = \B\ with A, B disjoint, 
a test for A U B at a given power (3 has the same cost as testing separately 
for both A or B at power (3. 

A particular case, treated in [13] and [20], in the setting of a fixed hi- 
erarchy, is to assume c(X^p A ) = ^(Pa) for some function ^. The model 
considered here is more general. 
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Fig. 3. Example of typical CTF strategies. Left: breadth-first; right: depth-first. 

4.7. Special strategies. In the following sections our main goal will be 
to determine under what additional hypotheses the optimal strategies are 
"coarse-to-fine" (CTF). 

Definition 2 (Coarse-to-fine). A strategy T £ T is CTF in resolution 
or just CTF, if an attribute is tested if and only if each of its ancestors has 
already been tested and returned a positive answer. A strategy T G T is 
CTF in power if, for any two nodes s,t along the same branch, /3(s) > (5(t) 
whenever A(s) C A(t). 

In the case of filtering a single pattern, this simply means that a CTF 
strategy performs all the relevant tests in the order of increasing resolu- 
tion, that is, Xa 1 j • • • j Xa l ■ For general pattern filtering, several different 
strategies have the CTF property, for instance, "breadth-first" and "depth- 
first" search. In Figure 3 these two CTF strategies are illustrated in the 
case of a hierarchy of depth L = 5 and test outcomes such that y(T ct f ) = 0, 
that is, no patterns are verified at all resolutions due to the "null covering" 
{X 3 ,i = 0, A 4 , 3 = 0, A 4 , 4 = 0, A 4 , 5 = 0, A 4 , 6 = 0, A 3 , 4 = 0} (writing X ljk for 
the kth test at depth /). Notice that the breadth-first CTF strategy has the 
nice feature that the tests are always performed in the order of nondecreasing 
depth. 

For a fixed hierarchy, all CTF strategies for pattern filtering perform 
exactly the same tests (although perhaps not in the same order). Whatever 
the order chosen, in the end, along any branch of the attribute hierarchy, 
every test has been performed starting from the root until the first null 
answer encountered on this branch. It is therefore possible to speak of "the" 
CTF strategy, it being understood that the precise order in which the tests 
are performed does not affect the mean cost. 

Note. Whereas we do not consider the problem of separating patterns 
from background in and of itself (as in [20]), it is interesting to observe that 
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the situation is more complex in that case since all CTF strategies are not 
equivalent. Indeed, in any optimal strategy, testing stops as soon as any 
complete "1-chain" is found and, consequently, depth- first CTF strategies 
are generally optimal, as shown in [20]. 

The probability of performing a test Xa in a CTF strategy for a fixed 
hierarchy has a simple expression: 

q A (T) = P (X A performed in T) = P (X B = 1, for all B D A, B j= A) 

= na-» 

bda 

Moreover, under the CTF strategy Y minimizes C pos t(Y(T)), and, in fact, 
(6) Y(T ctf )=Y(X) a.s., 

which can also be identified with the set of all "1-chains" in the hierarchy. 
It follows that the total mean cost of the CTF strategy is then given by 

E [C(T M )] = J2 c{A) J] (1 - 0(B)) + Eo[\Y(X)\]- 
AeA bda 

Still in the case of a fixed hierarchy, it will be useful to delineate all 
strategies with property (6). 

Definition 3 (Complete strategies). A strategy T £ T is complete if 
Y(T) = Y(X). The family of complete strategies is denoted by T. 

Remark. Under the hypotheses we have made, for a complete strategy 
it is possible to compute explicitly the probability of error under the null 
hypothesis before the postprocessing step, that is, to calculate Pq(Y ^ 0). 
(This is the probability that at least one nonnull pattern is detected when 
only background clutter is actually observed.) For single-pattern detection 
it is just the probability under Pq that all the tests along the vine respond 
positively: P (Y / 0) = JjLii 1 ~ 0k) [where fa = 0(A k ) = P(X Ak = 0)]; for 
detection of all possible patterns it is exactly the probability that there exists 
a "1-chain" leading from the root of the attribute tree to one of its leaves. 
Given the independence assumption on the tests under Pq, this in turn 
is exactly the probability of nonextinction of an inhomogeneous branching 
process at generation L, which can be computed explicitly once the branching 
probabilities [i.e., f3(A), A £ A] are known. 

Finally, for a variable-power hierarchy X there are many different, noncost- 
equivalent CTF strategies depending on the powers chosen for the tests along 
each branch. Nonetheless, surprisingly, the optimal CTF strategy can some- 
times be precisely characterized, being CTF in power with, in fact, a unique 
power assigned to each attribute (see Section 6). 
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5. Optimal strategies for fixed costs and powers. Throughout this sec- 
tion we assume a fixed test hierarchy X = {Xa, A £ .4} and we write c(A),f3(A) 
for the cost and power, respectively, of Xa- We will then refer to "testing an 
attribute A" or "attaching an attribute" to a node of T without ambiguity. 
Our goal is to identify conditions (trade-offs) involving {c(A), f3(A), A 6 A} 
under which optimal strategies may be characterized. 

For parts of this section it will be easier to actually consider the equivalent 
model with perfect tests in lieu of the postprocessing cost, as described in 
Section 4.5.4. From here on, A will denote the augmented hierarchy, and the 
considered strategies T for A will satisfy the no-error constraint. In other 
words, in the augmented model, when the strategy ends all patterns y £y 
must have been covered by at least one test which has been performed and 
returned (again, it may be one of the perfect, artificial tests representing 
postprocessing). We start this section with a fundamental formula for the 
average cost Eq[C(T)\ that will be useful for all of the results to follow. 

5.1. Reformulation of the cost. As just pointed out, in the augmented 
hierarchy model strategies must find a way to "cover" all patterns with 
attributes whose associated test is negative. Therefore the notion of covering 
will play a central role in the analysis to come, motivating the following 
definitions: 

Definition 4 [Covering). A set of attributes Z C A is a covering if 

\J{A,Aez} = y. 

The set of coverings for the augmented hierarchy A is denoted Z(A). 

Definition 5 (Tested attributes). For a given strategy T, denote by 
X(T) the (random) set of attributes tested by T, and by Xq(T) the set of 
attributes in X(T) for which the corresponding test returned the answer 0, 
called the zero set of T. 

Of course, the no-error constraint for a strategy T now reads simply: 
Xq(T) is (a.s.) a covering. We now turn to an important formula: 

Lemma 1 (Cost reformulation). For any (no-error) strategy T for the 
augmented hierarchy A, 
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Proof. For any attribute A £ A, let Xa(T) = P (A e X (T)) and let 
qA(T) = Pq(X a performed by T). Note that we have two useful expressions 
for X A (T), 

(8) Xa(T)= Y,_ p °(Mt) = z) 

zez{A) 

Z3A 

and 

X A (T)=P (AeX(T),X A = 0) 

(9) 

= P (A G X(T))P (X A = 0) = q A (T)f3(A), 

where the second equality comes from the fact that the event that A is 
performed by T only depends on the values of tests for other attributes, and 
is thus independent of Xa by the independence assumption. 
Now recalling expression (3) we have 

Eo[C(T)] = J2_c(A)q A (T) 
AeA 

aga K ' 

- £ $H 

A£A ' 

A&A ZeZ(A) 
Z3A 

This lemma combines two straightforward observations. First, the cost 
"generated" by a specific attribute A using strategy T can be written as 

c(A)P (A G X{T)) = j^Po(A G X(T))P (X A = 0) 

(10) 

Second, the sum over attributes of the last expression can be reformulated 
as a sum over coverings (using the no-error property). Note in particular 
that (10) has the following interpretation: As far as average cost is concerned, 
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it is equivalent to (i) pay the cost c(A) every time test Xa is performed, 
or (ii) pay the cost c(A) / (3{A) when Xa is performed and returns the answer 
but pay nothing otherwise. 

Note also that the lemma implies that the average cost Eq[C(T)] is there- 
fore a convex combination of the quantities J2agZ ~@$) ^ or ^ ^ Z(A). 

5.2. Filtering one special pattern. Recall this corresponds to the case 
where the set of attributes has the structure of a vine (see Figure 2). We 
can imagine two broad scenarios: In one case, there is really only one pattern 
of interest, and hence no issue of invariance other than guaranteeing that 
every test is positive whenever Y = y* . Imagine, for example, constructing a 
sequence of increasingly precise "templates" for a given shape, in which case 
both power and cost would typically increase with precision. In another sce- 
nario, one could imagine utilizing a hierarchy of tests originally constructed 
for multiple patterns in order to check for the presence of a single pattern y* . 
Clearly, only one particular branch of the hierarchy is then relevant, namely 
the branch along which all the attributes contain y*. Obviously, such tests 
would typically be less dedicated to y* than in the first scenario, except at 
the final level. In either case the natural framework is a sequence of tests, 
say Xg for attributes Ag, with costs eg and powers j3g for i = 1, . . . ,L, and 
the natural background measure is conditional on7/i/*. Also, it is simpler 
here to consider the augmented hierarchy setting, so that we assume that 
there is a test at level L + l with 0l+i = 1, cl+i = c*. 

The important quantity is the cost normalized by the power, {^}- Let 
n(£), £ = 1, . . . ,L + 1, denote the ordering of these ratios, 

(11) C w(1 ) < C n{2 ) < ^ C n{ L+l) ^ 

Pn(l) ~ Ai(2) ~~ ~ Pn(L+l) 

Since we are in the setting of the augmented hierarchy, there exists a dis- 
tinguished index £* corresponding to the perfect test for which c n rg*\ = c* , 
Pn(£*) = 1- 

Theorem I. The optimal strategy for detecting a single target pat- 
tern is to order the tests in accordance with (n(l), n(2), . . . , n(£*)), that is, 
perform X n ry\ first, then X n (2) whenever X n M^ = 1, and so on, and stop 
with X n a*y The tests X n ^ for k > £* are never performed. 

Note that the last test, X n ^g*y is the perfect one, and always returns the 
answer under Pq. Reinterpreted in the original model, this would mean 
that if X n rg*_i\ is reached in the strategy and returns answer 1, then the 
testing procedure ends and the postprocessing stage is performed. 

This theorem is a consequence of a straightforward recursion (proof omit- 
ted) applied to the following lemma. 
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Lemma 2. There exists an optimal strategy for which the first test per- 
formed is X n (X) . 

Proof. Let T be some strategy performing the tests in the order n'(l), n'(2) 
[for some k* < L + 1, with n'(k*) = n(£*) = L + 1]. Assume ra'(l) / n(l) and 
consider strategy To obtained by "switching" X n ^ to the first position, 
that is, performing X n (i) first, and then whenever X n ^ = 1 continuing 
through strategy T normally except if an index i is encountered for which 
n'(i) = n(l), in which case X n ^ is not performed again, but just skipped. 

Compare the costs of T and To using (10) summed over attributes: clearly 
the mean cost of these strategies is a convex combination of the (eg/Pi), 
£ = 1, . . . , L + 1, since J2e=i P{Ai £ Xq{T)) = 1 in the single-pattern case. 
More explicitly, 

P(A k eX (T)) = l3 k J] (!-&) 

t:n'(£)<n'{k) 

with the corresponding formula for To. From this formula it is clear that 
the weight for the ratio c n ^/f3 n ^ is higher in To than in T, while all the 
other weights either are smaller or stay the same (depending whether the 
corresponding tests were placed before or after X n ^ in T). Since c n ^/P n ^ 
is the smallest of the ratios, the average cost of To is lower than the cost 
of T. □ 

5.3. Filtering all patterns. Our goal is to determine conditions under 
which (4) is minimized by the CTF strategy. First, we consider a simple 
sufficient condition which guarantees that the optimal strategy is complete, 
meaning T £ T. [Recall that T S T if Y(T) = Y(X); in other words, testing 
is halted if and only if all "1-chains" in X are determined.] This condition is 
by no means necessary since we will prove the optimality of the CTF strat- 
egy (which belongs to T) under a much weaker condition, but is, however, 
informative. 

Proposition 1. // for any attribute A G A, |py < c* , then the optimal 
strategy must belong to T . 

Proof. Let T be an optimal strategy and let s denote a leaf of T. 
Recall that X{s) is the set of tests along the branch terminating in s and 
Y(s) = y\ \J{Xa G X(s)\X A = 0}. The expected cost of T is then of the 
form 

(12) E Q [C(T)] = C + p s c*\Y(s)\, 

where p s is the probability of reaching s, the second term is the contribution 
to the mean postprocessing cost at leaf s and C denotes the contributions 
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of other nodes to the average cost. In general Y(X) C Y(s), and if these sets 
do not coincide, then by definition there must be a test Xa ^ X(s) for which 
A n Y(s) 7^ 0. Consider the strategy T' obtained by adding this test to T 
at node s. Then 

(13) E [C(T')] = C + Ps [c(A) + P(A)c*\Y(s) \A\ + (1- 0(A))c*\Y(s)\}. 

Since — \Y(s) \A\ > 1 it follows easily from the hypothesis, (12) and (13) 

that E [C(T)] - E [C(T')\ > 0, which contradicts the optimality of T. □ 

We now turn to the problem of optimality of CTF strategies. The method 
of proof used in Section 5.1, although very simple in that case, will still 
serve as a template for most of the results to come. More precisely, under 
different assumptions about the models, we will always try to first establish 
the following property, denoted (CF) for "coarsest first" : 

Definition 6 [(CF) property]. Test hierarchy X satisfies the (CF) prop- 
erty if there exists an optimal strategy for which the first test performed is 
the coarsest one. 

In most cases, we will establish the optimality of T ct f as a consequence 
of (CF) for the various models considered. The current model — fixed, power- 
based cost — is the simplest and allows us to present the main ideas behind 
the arguments based on the (CF) property — a recursion based on "subhier- 
archies" and the concept of a "conditional strategy." As always, A is a nested 
hierarchy of attributes. 

Definition 7 (Subhierarchy). We call B C A a subhierarchy of A if 
there exists an attribute Bq £ A such that 

B = {AeA\AQB }. 

More specifically, we call B the subhierarchy rooted in Bq and we refer to Bq 
as the set of patterns spanned by B, also denoted 3%. 

Definition 8 (Conditional strategy). Let Ai be the root of A and let 
B be a subhierarchy of A rooted in one of the children of A\. Then A can 
be written as a disjoint union A = {A\} (JBUB. Let x-^ be a set of numbers 
in {0, 1} indexed by B. Consider a testing strategy T for ^4. The conditional 
strategy Tg(x^) on subhierarchy B is defined as follows: For every internal 
node t of T: 

(i) If X(t) is a test for an attribute B e B, leave it unchanged. 

(ii) If X(t) = Xa ± , cut the strategy subtree rooted at t and replace it by 
the right subtree of t. 
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(iii) If X(t) is a test for an attribute A G B, cut the strategy subtree 
rooted at t and replace it by the right subtree of t if xa = 1 , and by the left 
subtree of t if xa = 0. 

Finally, relabel every remaining leaf s by Y(s) n 3^3- 

This rather involved definition simply says that Tq(x-q) is the testing 
strategy on subhierarchy B obtained from T when Xa 1 = 1 and the answers 
to X-g = {Xb,B G B} are fixed to be x-g, and T is pruned accordingly. An 
obvious but nevertheless crucial observation is that Tg (a>g) is indeed a valid 
testing strategy for the subset of attributes B and the corresponding subset 
of patterns y B . 

Theorem 2. If property (CF) ZioWs for any subhierarchy B of A (in- 
cluding A itself ), then the CTF strategy is optimal. 

Proof. The proof is based on a simple recursion. Let L be the depth of 
A. The case L = 1 is obvious from the (CF) property. Suppose the theorem 
is valid for any L < Lq with Lq>2. Now consider the case L = Lq. 

Let T be an optimal testing strategy. From the (CF) property, we can 
assume that the test at the root of T is Xa 1 , the attribute at the root 
of A. Denote by £>!,...,£>& the subhierarchies rooted at the children of Ai, 
which are of depth at most Lq — 1. Since A = {A±} UB± U ■ • ■ UBf. (a disjoint 
union), and y = y& l U • • • U3^? fc , we can partition the cost of T as follows: 

E [C(T)} = ]T qA (T)c(A)+E [c*\Y(T)\] 
AaA 

(14) =q Al {T)c(A l )+ £ q A (T)c(A)+E [c*\Y(T)ny Bl \} 

AeBi 

+ •••+ £ qA(T)c(A)+E [c*\Y(T)ny Bk \}. 
AeB k 

Let us focus on the first sum. Let Q(xq^) be the event {X-^ = xj^-, Xa 1 = 
1}. Consider the conditional strategy TW = Tq 1 {x-^) and let qA(T^; x-^) 
be the probability under Pq(-\Q(x^)) of performing the test for A^B\ us- 
ing . The tests {Xa, A G Bi} are conditionally independent given Jl(x-g^) , 
with powers {(3a, A G Si}. By the recurrence hypothesis, we can apply the 
theorem to subhierarchy B\ for the above conditional probability and con- 
clude that the cost of strategy T^> satisfies [for any (x^)] 

E [C(TW)\Q(x w )}= £ c(A)q A (TW;x w ) + Eo[c*\Y(TW)\\n(x w )] 
AeBi 

(15) 

>E [C(T^)\n( XB -)} = E [C(T^)], 
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where T^}) is the CTF strategy for hierarchy B\ [whose cost is indepen- 
dent of fi(x^j-)]. We now want to take the expectation of (15) conditional 
on {X Al = 1} only; by independence of the tests this amounts to taking the 
expectation of (15) with respect to (Xgr-). Now, by construction of the con- 
ditional strategy, denoting by f3\ the power of test X Al , for all A € B\ we 
have 

E [q A (T^;X K )\X Al = 1] = P [X A performed by T\X Al = 1] 

= q A (T)(l-P l )-\ 

where the last equality holds because X Al is the first test to be performed 
in T. Similarly, on the event {X Al = 1} we have Y(T^) = Y(T) ny^, and 
therefore 

e [c*\y(tW)\\x Ai = i] = Eq[c*\y(t) ny Bl \\x Al = 1] 

= E [c m \Y(T)ny Bl \](i-ih)- 1 , 

[since X Al = =^ Y(T) = 0]. Therefore, taking expectations w.r.t. (Xg-) 
in (15) we obtain 

E [C(T Bl (X w ))] = (l-(3 1 r 1 ( c(A)q A (T) + E [c*\Y(T)ny Bl \]) 

\AeBi / 

>E [C(tW)}. 

Applying the same reasoning to the other terms of (14), we now obtain 

E [C(T)] > c(A±)q Al (T) + (1 - MEqMtW) + ■■■ + C(T™)]. 

Finally, the right-hand side is precisely the total cost of the CTF strategy 
for A. Therefore the CTF strategy is optimal. □ 

We now give a sufficient condition ensuring the (CF) property. 

Theorem 3. Let A\ be the coarsest test. Then the (CF) property holds 
under the condition 

<^-< inf y^l. 

Corollary 1. Consider the augmented hierarchy A as a tree structure 
(the original hierarchy A can then be seen as the set of internal nodes of A). 
For any A £ A, let C(A) be the set of direct children of A in A. Then the 
CTF strategy is optimal if the following condition is satisfied: 

n«\ VA^ A c ^ <r ST c ^ 
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Proof of Theorem 3. For this proof, it is easier to work with the 
"augmented" model put forward in Section 4.5.4. Let T be a testing strat- 
egy for A such that the first attribute to be tested is not the coarsest at- 
tribute A\. From T construct the strategy To by "switching" test Xa x to 
the root, that is, perform Xa 1 first, and when the result is 1, proceed nor- 
mally through strategy T, except when test Xa x is encountered in T, in 
which case it is not performed again and one jumps directly to its right 
child (corresponding to Xa± = 1 in the original T). 

Now compare the means cost of T and To using (7). Similarly to the proof 
of Lemma 2, we will prove that in the convex combination defining the cost 
in (7), the weight of the term c(A\)/j3(A\) is higher in To than in T, while 
the weights of all the other terms of the form (J2agz C (A) / /3(A)) are smaller 
or stay unchanged for all other coverings Z G Z(A). This together with the 
hypothesis of the theorem establishes property (CF). 

To verify the above statements about the weights of the different cov- 
erings, first call the "covering support" CS(T) of a strategy T the set of 
coverings Z G Z(A) such that Pq(Xq(T) = Z) ^ 0. It is clear from the con- 
struction of To that CS(To) C CS(T) U {{^4i}}. Therefore we can restrict the 
analysis to the coverings in Zq = CS(T) U {{Ai}}. 

Note that CS(T) is in one-to-one correspondence with the set of leaves of T 
having nonzero probability to be reached; for any Z G CS(T), P(Xq(T) = Z) 
is precisely the probability to reach the leaf st{Z) of T associated with the 
covering Z . Along the branch leading to this leaf one finds all the events 
{Xa = 0} for A G Z, along with a number of other events {Xa = 1} for A 
in a certain set Xi(st(Z)). Therefore this probability is of the form 

P (MT) = Z)= P (s T (Z) is reached) = ]J (3(A) ]J (1 - f3(A')). 

A&Z A'eX^sxiZ)) 

Now with this formula in mind, any Z £ Zq falls into one of the following 

cases: 

1. Z = {^4i}, in which case obviously Po(Xq(Tq) = Z) > Pq(Xq(T) = Z); 

2. A 1 G Z but Z / {Ai}, in which case P (X Q (T ) =Z) = 0; 

3. A x i Z and A x $ Xi(s T (Z)), in which case P (X (T ) = Z) = (1 - ft) x 
P (MT) = Z); 

4. At $ Z and A x G X^s^Z)), in which case P (X (T ) = Z) = P (X (T) = 
Z). 

Together, these different cases prove the desired property: {A\} is the only 
covering having higher weight in the cost of To than in the cost of T. □ 

Corollary 1 follows immediately: Its hypothesis clearly implies that the 
hypothesis of Theorem 3 is satisfied for any subhierarchy of A and the 
conclusion then follows from Theorem 2. 
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Note that, in contrast to what happened in the case of single-target detec- 
tion, condition (16) falls short of being a necessary condition for ensuring the 
optimality of CTF strategies. To obtain a counterexample, consider the case 
of a depth-2 hierarchy with a coarsest attribute A\ and two children B\, B 2 , 
and suppose that c* is large enough so that the condition of Proposition 1 
is satisfied, so that we may restrict our attention to complete strategies. 
Then one can show (by explicitly listing all possible strategies) that the 
CTF strategy is optimal iff 

<M) < . J <B X ) c{B2}_ cCBO c(B 2 ) \ 

P(A{) ~ \P(B 1 )P(B 2 ) + (3(B 2 ) ' /9(Bi) + P(B 1 )(3(B 2 ) J ' 

Clearly this condition is weaker than (16). 

Application to the power-based cost model. We can now look at the con- 
sequences of these results if we assume the cost model given by (5), in which 
case the following corollary is straightforward. 

Corollary 2. Assume the cost of the attribute tests obeys the model 
given by (5), with T subadditive and *$>(x)/x increasing. Then the CTF strat- 
egy is optimal for any hierarchy A for which (3(A) < (3(B) whenever B C A. 
In that case, the optimal strategy is CTF in both resolution and power. 

Similarly, in the case of detecting a single pattern of interest, if we assume 
r = 1, the CTF strategy is optimal when ^(x)/x is increasing, a result that 
was already proved in [13] . 

5.4. Simulations with an elementary dependency model. We also per- 
formed limited simulations in the case where the tests are not independent 
under Pq but obey a very simple Markov dependency structure. Suppose 
the power of the coarsest test is (3\] the powers of subsequent tests follow 
a first-order Markov model depending on their direct ancestor. More pre- 
cisely, the probability that a test returns is 7 (resp. A) given that its father 
returned (resp. 1) with 7 > A. The cost model used is the multiplicative 
cost model given c(X^p) in (5), with (3 the average power of the test. 

We performed experiments for a set of four patterns and a corresponding 
depth-3 dyadic hierarchy, comparing the cost of the CTF strategy to the best 
cost among a set of 5000 randomly sampled strategies. In our experience, 
due to the restrained size of the problem, when there are in fact strategies 
better than the CTF one, then this is usually detected in the simulation. 

What we found was that, for a given value of 7 and A, the CTF strategy is 
generally optimal when /3± < A (for various choices of the power function *S>). 
However, when (3\ becomes too large, then the CTF strategy is no longer 
optimal. Heuristically, this is because the coarse questions are then more 
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powerful but also much too costly. The limiting value of [3\ for which CTF 
is optimal does not appear to be equal to the value (5* = + X — 7), the 
invariant probability for the Markov model. In particular, there are cases 
where A < (3\ < (3* (meaning that the average powers are increasing with 
depth) and yet CTF is not optimal. 

To conclude, these very limited simulations seem to suggest that, even 
though the optimization problem is already somewhat complex even with 
a simple dependency structure and leads to challenging questions, still the 
optimality of CTF strategies can be expected to persist over a fairly wide 
range of models. 

6. Optimal strategies for power-based cost and variable powers. 

6.1. Model and motivations. In this section we only consider searching 
for all possible patterns. The previous section dealt with a fixed hierarchy — 
a single test Xa at a given power f3(A) for each A £ A. Now suppose we 
can have, for each A£ A, tests of varying power; of course, a more powerful 
test at the same level of invariance will be more expensive. (In Section 8 we 
illustrate this trade-off for a particular data-driven construction.) In fact, for 
each attribute A £ A, we suppose there is a test for every possible power, 
whose cost is determined as follows: 

Cost model. Let \E' : [0, 1] — ► [0, 1] be convex and strictly increasing with 
tf(0) = and *(1) = 1 and let T:N* -> R + be subadditive with T(l) = 1. 
We suppose 

(17) c(X AjP ) = c(A, 0) = cx T(\A\) x *(J3). 

Recall that the total cost of a strategy T is given by 

c test (r) + c*|F(r)|. 

The constant c in (17) represents the cost of a Po-peifect test for a single 
pattern and the constant c* represents the cost per pattern of disambiguating 
among the patterns remaining after detection. Evidently, only the ratio c/c* 
matters. We are going to assume that c* = c= ^(l) = l; note that this choice 
coheres with the formal interpretation of postprocessing cost as the cost of 
"errorless testing" put forward in Section 4.5.4. 

For the rest of this section we will implicitly adopt this point of view, 
that is, replacing effective postprocessing cost by formal perfect tests cor- 
responding to an additional layer of formal attributes copying the original 
leaves (this formal doubling of the leaf attributes allows us to keep untouched 
the rule that no attribute can be tested twice). For these special tests only, 
the power cannot be chosen arbitrarily and is fixed to 1; and the strategies 
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considered must make no errors, enforced by performing at the end of the 
search some of these perfect tests if needed. 

We are going to focus primarily on the case T(k) = k. Consequently, 

(18) c(X AuB)t3 ) = c(X Ai/9 ) + c(X B ,f}) when A n B = 0. 

This is, in effect, the choice of V least favorable to CTF strategies since there 
is no savings in cost due to shared properties among disjoint attributes. For 
instance, in practice it should not be twice as costly to build a test at power 
[3 for the explanation {E, F} as for {E} or {F} separately at power j3, 
since (upon registration) these shapes share many "features" (e.g., edges; 
see Section 8). Nonetheless, with this choice of T the convexity assumption 
for can now be justified as follows: 

Motivation for convexity. As usual, two tests for disjoint attributes are 
independent under Pq. Consider the following situation: For A and B dis- 
joint, first test A with power Pi and stop if the answer is positive {Xa = 1); 
otherwise, test B with power ft and stop. This produces a randomized, 
composite test for AU B with power ftft an d (mean) cost 

|,4|*(ft)+ft|B|tf(ft). 

Contrast this with directly testing AUB with power ftft, which should not 
have greater cost than the composite test since, presumably, we have already 
selected the "best" tests at any given power and invariance; see Section 8 
for an illustration. Under our cost model, this implies 

(19) (|^| + |£?|)^(ftft) < |A|#(ft) + ft|B|*(ft). 

Demanding (19) for any two attributes implies (by letting |^4|/|£>| — > 0) that 
we should have 

(20) *(ftft)<ft*(ft). 

[Conversely, it is easy to see that if (20) is satisfied, then (19) holds for 
any \A\, \B\.] Since we want (20) to hold for any ft, ft £ [0,1] we see (after 
dividing by ftft) that (20) implies that *$>(x)/x is an increasing function. 
In our model we make the stronger hypothesis that ^ is convex in order to 
simplify the analysis. 

Remark on independence. It would be unrealistic to assume the inde- 
pendence of all the tests in the variable-power hierarchy X, rather than 
for families corresponding to different attributes. In fact, in practice there 
is a limit to the number of independent (or even weakly dependent) tests 
that can be made for a fixed attribute. Were there not, then near-perfect 
detection would be possible in the sense of obtaining arbitrarily low cost 
and error by performing enough cheap tests of high invariance, at least in 
the case in which ^'(O) = 0. 
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Example. Let A = y, the coarsest attribute, and suppose {X^p^j = 
1,2,.. .} are independent with j3j \ 5. Consider the vine-structured testing 
strategy T n which successively executes X^p.,j = 1,2, . . . ,re, stopping [with 

label Y(T n ) = 0] as soon as a null response is found and otherwise yielding 
Y(T) = y. Then it is easy to show (see [6]) that P (Y(T n ) = y) < (1 - 5) n 
and that E [C test (T n )] < \y\^§^-. Since V(5)/6 -» 0, given e > 0, we can 
choose n, 5 and j3\ close enough to 5 such that Po(y(T n ) ^ 0) < e and 
Eo[C(T n )] = E [C tcst (T n )] + £b[|y(T n )|] < e. 

6.2. Basic results. In the sequel, Vf* will denote the Legendre transform 
of*, 

**(x)= sup -*(/?)). 

/36[0,1] 

In addition, for any a > 0, define 

«£(*)= a** 

$ a (x) = X-**(x). 

6.2.1. Optimal power selection. Consider partially specifying a strategy 
T by fixing the attribute A to be tested at each (internal) node but not 
the power. What assignment of powers (to the nonperfect) tests minimizes 
the average cost of T? As with dynamic programming, it is easily seen 
that the answer is given as follows: Start by optimizing the powers of the 
last, nonperfect tests performed along each branch (since the left and right 
subtrees of such a node have fixed, known cost), and then climb recursively 
up each branch of the tree, optimizing the power of the parent at each step. 
The actual optimization at each step is a simple calculation, summed up by 
the following lemma: 

Lemma 3. Consider a (sub) strategy T consisting of a test Xj^ p at the 
root, a left subtree Tl of average cost x and a right subtree Tr of average 
cost y. Let r(|^4|) = a. Then under the cost model (17) the average cost of 
T using the optimal choice of (3 is given by 

(21) E [C(T)} =y- %(y - x) = x + <S> a (y - x). 

In particular, if Tl is empty, then x = and Eq[C(T)] = $ a (y). If * is 
differentiable, the optimal choice of (3 is 

( m~H(y - x)/a), if (y - x)/a e *'([0, 1]), 
0* = \q, if (y-x)/a <*'(()), 

U if{y-x)/a>^{\). 

More generally, * admits (y — x) /a as a subgradient at point (3* . 
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Proof. Let T(f3) denote the strategy using power /3, and calculate the 
average cost of T{(5) as a function of (3,x,y,a, 

Eq[C(T{P))] = c{X A , p ) +(3x + (l- (3)y 

Now minimizing over (5 leads directly to (21) and the formulae for /?*, using 
the definitions of ^* and □ 

6.2.2. Properties of the CTF strategy. In previous sections, with fixed 
powers, all variations on CTF exploration (e.g., depth-first and breadth- 
first) had the same average cost, and hence we spoke of "the" CTF strategy. 
With variable powers the situation might appear different: The bottom-up 
optimization process in Section 6.2.1 for assigning the powers may lead to 
different mean costs for different CTF strategies. More specifically, recall 
that A(s),f3(s) denote the attribute and power assigned to an internal node 
s in a tree T. For CTF trees, it may be that /3*(s), the optimal power at s, 
depends on the position of s within T as well as A(s). 

The following theorem states that, in fact, as in the fixed-powers case, 
among CTF strategies, the order of testing is irrelevant when the powers are 
optimally chosen. More precisely, the optimal power of a test depends only on 
the attribute being tested, specifically on the structure of the subhierarchy 
rooted at the attribute. Consequently, in CTF strategies a given attribute 
will always be tested at the same power, which means that CTF designs can 
be implemented by constructing only one test per attribute — a considerable 
practical advantage. 

Theorem 4. For any CTF strategy T, and for any two nodes s,t inT 
with A(s) = A{i), the optimal choices of powers are identical: j3*{s) = (3*(t). 
In fact, the unique power assigned to an attribute A £ A depends only on 
the structure of subhierarchy B{A) rooted in A. As a consequence, all CTF 
strategies have the same average cost. 

Whereas the principle of the proof is simple (a recursion on the size of A), 
it does require some auxiliary notation, and hence we postpone it to the 
Appendix. 

Turning to the cost of the CTF strategy, it can easily be computed recur- 
sively for regular attribute hierarchies and the simple complexity function 
T(k) = k. More precisely, we have the following theorem for dyadic hierar- 
chies, in which /3|(-L) denotes the optimal power for the 2 £ ~ l attributes at 
level £ = 1, . . . , L for a hierarchy of total depth L. 
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Theorem 5. Let Cl denote the average CTF cost of a regular, complete 
dyadic hierarchy of depth L. Then 

(22) C L+ i = $ 2 l (2C L ) 

with (formally) Cq = \&(l)/2. Furthermore, 



C L /2 L ^ \ tf'(O), L 



oo, 



and 

(23) /?i*(£) \0, L^oo. 
Finally, 

(24) ft{L) = fi(L-l + l), 1 = 1,..., L, 

from which it follows that the CTF strategy is CTF in power, that is, power 
increases with depth. 



Proof. Consider a (complete, dyadic) hierarchy of depth L + 1. The 
coarsest attribute has cardinality \A\\ = 2 L and the (optimized, breadth- 
first) CTF strategy starts with the corresponding test. If Xa 1 = 0, the search 
is over; if not, it is necessary to pay the mean cost for the two subhierarchies 
of depth L. We thus apply (21) with x = 0, y = 2C L to obtain (22). When 
L = 1 (one pattern) , it is easy to check that we retrieve the right value of C\ 
from (22) with Cq = ^f(l)/2 by noting that, in this case, y = ^(l), which is 
the cost of a perfect test. 

Let Ul = Cl/2 l ~ 1 . Then (22) can be rewritten as 

U L+1 = <f>i(U L ), 

which allows us to study the asymptotic behavior of Ul when L is large 
based on the function $i(x) = x — ^f*(x). Since is convex, it follows 
that —jp- is increasing, and hence x/3 — (/5) < for all < f3 < 1 (with 
equality at (3 = 0) whenever < x < ^'(0). Consequently, ^*(x) = and 
$i(x) = x for x G [0, ^'(0)]. Similarly, 3>i(x) < x for x > \&'(0). We have 
Uq = ^(l) > ^'(0) because \E' is convex, and hence since 3q is concave, 
Ul\ ^'(O) as L — > oo. Finally, from Lemma 3 we can also conclude that 
f3*(L) = (^ / )~ 1 ([/ L A ^'(1)). The last assertion (24) of the theorem follows 
directly from Theorem 4. □ 

Remark 1. We deduce from the above results that if ^'(0) = 5 > 0, we 
have C L ~ S2 L ~ 1 . If, on the other hand, *'(0) = 0, then C L = o(2 L ~ 1 ). This 
should be compared to the strategy of performing only (all) the perfect tests, 
which costs 2 L ~ 1 . 
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Remark 2. Since the optimal powers are increasing with depth, if we 
now consider them as fixed we are in the framework of Corollary 2 ensuring 
that, for these choices of powers, the CTF strategy is indeed optimal. 

Remark 3. Note that the cost of individual tests (with optimal powers) 
may not vary monotonically with their depth; however, the cumulated cost 
of all tests at a given depth increases with depth. 

6.3. Is the CTF strategy optimal? We have not been able to prove the 
optimality of the CTF strategy under general conditions on ^f, but rather 
only for one specific example. This is disappointing because the simulations 
presented later in this section strongly indicate a more general phenomenon. 

If we try to follow our usual method for proving optimality, it turns out 
that the most difficult step is actually to prove the (CF) property. Under 
the (CF) property, the optimality of CTF would readily follow — it suffices 
to follow the lines of the proof of Theorem 2 with minor adaptations, mainly 
replacing families {X a )a&b by (^4,/3Ue8,/3e[0,i]- 

One way to prove the (CF) property is to proceed iteratively, repeatedly 
applying the "switching property": 

Definition 9 {Switching property). A power function ^ has the switch- 
ing property if any (sub)tree T of the form shown on the left-hand side of 
Figure 4, with any powers, has a larger mean cost than the tree obtained by 
switching the two first tests of T (shown on the right-hand side of Figure 4) , 
with optimal powers. Using Lemma 3, this inequality may be expressed as 
follows: 



Fig. 4. The context of the switching property. Attribute A\ is the coarsest attribute in 
the hierarchy; hence Y(\B\) = b <T(\Ai\) = a. 



Vy>x>0,Va>6>0 
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(25) 

<5> a {x + (f> b (y - X)) < $ a (x) + $fc($ a (y) - ®a(x)). 

We then have the following lemma. 

Lemma 4. The (CF) property is implied by the switching property. 

Proof. Note first that we can assume that Xa±, the coarsest test, is 
performed at some point (at some power) along every branch of any T. If this 
is not the case, it can simply be added, with zero power, at the end of any 
branch where it does not appear without changing the cost. Now let T be a 
strategy such that Xa ± is not performed first. Apply the switching lemma 
to any subtree of T of the form shown on the left-hand side of Figure 4. In 
this way, Xa ± is pushed up in the tree while reducing the cost. This can be 
done repeatedly until no such subtree exists, that is, the situation depicted 
in Figure 4 does not occur anywhere in T. But then the resulting tree must 
have Xa x at the root. Otherwise, let k be the maximum depth in T where 
Xa 1 appears, and let s be the corresponding node. Let s' be the direct 
sibling of s, which exists since k > 1. Consider a branch b containing s' . 
Since Xa x is performed along any branch, it must be performed somewhere 
in b, say at node t. But t cannot be an ancestor of s' , since otherwise Xa x 
would be performed twice along branch b, a contradiction. Nor can t be a 
descendant of s', since that would contradict the definition of k. Therefore 
Xa± is performed at s', which contradicts the assumption that there is no 
subtree of the form shown on the left of Figure 4. This concludes the proof. 
□ 

From numerical experiments, we know, however, that the switching prop- 
erty is not satisfied for an arbitrary (convex) power function ty. Whereas we 
believe that it should be possible to prove the switching lemma under some 
additional conditions on 'I', we have so far only been able to prove it for one 
case we refer to as the "harmonic" cost function, 

(26) 1r(x) = 2-2vT — x — x, 

which we now investigate. 

6.4. CTF optimality for the harmonic cost function. Throughout this 
section ^ is given by (26). This function has the following properties: 

1. VP is convex and increasing; 

2. (0) = tf'(0) = and (1) = 1, tf'(l) = oo; 

3. **(x) =3-^; * (x) = ^ = (x- 1 + o" 1 )- 1 . 
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Note that x and a have symmetric roles in <& a , and that & a ( x ) is the "har- 
monic sum" of x and a. 

We first study the switching lemma in the case of an empty left subtree Ti . 

Lemma 5. Consider two tests Xa and Xb with T(\A\) = a and T(\B\) = 
b. Let Tab be the tree shown on the right-hand side of Figure 4 with T\ = 
and let Tba have the same structure with Xa and Xb reversed. Then, with 
the optimal assignment of powers to Xa and Xb, both Tab and Tba have 
the same cost. 

Proof. By applying Lemma 3 (with x = 0) twice, the cost of Tab is 
° ^b{y) and the cost of Tba is &b ° &a(y)- It is then easy to check that 

$ a o <D fe (y) = cD, o $ a (y) = f h \ = (a" 1 + r 1 + y- 1 )- 1 . n 

ay + by + ao LJ 

Note. Clearly, $ a o $b(x) is the harmonic sum of x,a and b. More 
generally, consider any "right vine" T consisting of at most one test per level 
of resolution. Then, under ^ the average cost of T (with optimal powers) 
is independent of the order in which the tests are performed; moreover, this 
average cost is simply the harmonic mean of the values L(|^4j|) for the tests 
performed. In particular, this result is totally independent of the choice of 
the complexity function L. 

We now return to the "full" switching lemma: 

Theorem 6. The switching property — and hence the optimality of the 
CTF strategy — holds for the harmonic power function with any complexity 
function T. 

For the proof see the Appendix. 

Analogy with resistor networks. We conclude this section with a curi- 
ous connection: Consider a hierarchy of depth L with coarsest attribute A± 
and a\ = r(|.Ai|). Let C\ be the average cost of the CTF strategy for the 
hierarchy with A\ removed. From Lemma 3, with x = and y = C\, 

£ [c(T ctf )] = ^(co = -^l. = (i + ^v 1 . 

This is exactly the conductance of an electrical circuit composed of two serial 
resistors of conductances C\ and a\. Continuing, C\ is the sum of the CTF 
costs over the two subhierarchies of depth L — 1; if C[ denotes the cost of 
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= I /a, 




Kj = 1 



R, = = 1 



Fig. 5. Tree-structured resistor network identified with the attribute hierarchy, where 
ai = r(|j4j|) is the complexity of attributes of level I and Ri = 1/ai is the associated resis- 
tance; note that 04 = 1 by convention for the bottom attributes. The last row of resistors 
represents the postprocessing stage. The conductance of this circuit is exactly the CTF 
testing cost of the attribute hierarchy when is the harmonic power function. 



these hierarchies, the cost C\ can be interpreted as the conductance of an 
electrical circuit formed from two parallel resistors, each of conductance C[. 
The global cost of the CTF strategy is therefore equal to the conductance 
of the tree-structured resistor network depicted in Figure 5 (wherein a row 
of resistors is added at the bottom of the tree in order to represent the cost 
of the postprocessing, or, equivalently, perfect testing). We observe that 
nothing would be changed in the case of a nonsymmetric, tree-structured 
hierarchy, even with attributes of varying complexities at the same level. 



6.5. Simulations. In this section we investigate the optimality of CTF 
search by way of simulations involving several different power functions ^f. 
In every case we take T(k) = k. The various choices of \&, and corresponding 
functions $i(x) = x — ^*(x), are presented in Table 1; obviously we have 
chosen functions with closed-form Legendre transforms. We took A = 1 for 
^4 and \i = 8 for * 7 . 

First we investigated the switching property, which we know to be suffi- 
cient for the optimality of T ct f. To this end, we computed and plotted the 
difference A(a,b,x,y) between the left-hand side and the right-hand side of 
the key inequality (25). Without loss of generality, we put a = 1. Plots of 
A (1,6, x,y) are given in [6] for the particular choice 6 = 2. The switching 
property is satisfied if the surface lies below the xy-plane. Some of these 
surfaces (corresponding to \&2, ^4, clearly do not, whereas the others 
appear to satisfy this inequality (at least all sampled values are negative). 
In other experiments with other values of b for ^1 and ^3 we always found 
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Table 1 

Convex power functions used in our simulations 



Number \E> 



1 as(l-vT=^) x - (1 - (1 - x + 1^/(1 - x) 2 + 3) 2 )(§0 - 1) + 1^/(1 - x) 2 + 3 ) 

2 x 2 /2 (x-x 2 /2, if as < 1, 

[ i, otherwise 

3 l-Vl-z 2 1 + x - Va; 2 + 1 

{x, if x < A, 

x-l-f(log(f)-l), ifA<x<Ae A , 
x-e x + l, ifz>Ae A 



5 2-x-2^/l _ ^ x/(l + x) 



6 1-yT^x 



x, if X < i, 

1 — -i- , otherwise 



expfux)-l-ux ^(1+ ?)-(! + ^)1°8(H- S)' * * <M(e" - 1), 

\e"-l- M , otherwise 

Note that ^5 is the harmonic function. 



A < 0. However, we found regions with A > for ^7 for higher values of b, 
and hence this cost function does not satisfy the switching property. 

From these plots it is tempting to speculate that only power functions 
such that ^'(0) = and ^'(1) = +00 can satisfy the full switching property; 
however, these conditions are very likely not sufficient. Note that \&'(0) = 
means that, at any given level of invariance, one can have an arbitrarily 
small cost-to-power ratio and = +00 means that very high powers are 

likely not worth the increased cost. Intuitively, both of these properties favor 
CTF strategies. 

The second type of simulation was more direct. Strategies were sampled at 
random by the simplest method possible: we sampled purely attribute-based 
strategies T by recursively visiting nodes and choosing an attribute A G A at 
random subject to the two obvious constraints: (i) no attribute is repeated 
along the same branch, and (ii) no "useless" attribute is chosen, meaning 
that A consists entirely of patterns already ruled out by the previous tests. 
Then, for each such T, powers were individually assigned to the tests at each 
node in order to minimize the cost, which was compared with that of the 
CTF strategy. This procedure was repeated for various choices of ^ [with 
T(k) = k] for regular, dyadic hierarchies for |3^| = 4 patterns (i.e., L = 3) 
and for |3^| = 8 patterns (i.e., L = 4). For each \&, we sampled several tens 
of thousands of trees T. [Of course the sheer number of possible strategies 
(modulo power assignments) in the case L = 4 is several orders of magnitude 
larger.] Summarizing our observations: 

(a) In all cases, the CTF strategy had lower cost than any other strategy 
sampled. 
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(b) Upon visual inspection, the best sampled strategies seemed close to the 
CTF strategy in the sense of only differing at relatively deep nodes. 

In conclusion, and bearing in mind the limited scope of both types of 
simulations, we believe the following conclusions are reasonable: 

1. The switching property is quite likely valid for cost models other than the 
harmonic function] however, it requires hypotheses in addition to convex- 
ity. 

2. The optimality of the CTF strategy probably holds for a very wide range 
of cost models, including those which do not satisfy the switching property 
{for all values of a,b,x,y). As a result, requiring the switching property 
is likely too restrictive and, more generally, arguments based on the (CF) 
property may not be the most efficacious. 

7. Remarks on a usage-based cost model. In this section we summa- 
rize some results obtained in [6] for a somewhat different scenario. We con- 
sider only the case of a fixed-powers hierarchy. In this model, the cost of a 
test c(X) may be chosen in accordance with the strategy employed; it de- 
pends on the "resource" r(X) allocated to it [through a negative exponen- 
tial function r(X) = exp(—c(X))] and there is a global resource constraint, 
^2xex r (X) < -R< 1- This corresponds to the belief that in some circum- 
stances it might not be efficient to fix the costs of the tests in advance, 
regardless of their inherent complexity. It may be more efficient to allow the 
utilization of computing resources to be partitioned in accordance with the 
frequency with which certain routines are performed; in this case the cost 
represents the computing time rather than the computing complexity. In 
this framework the optimal resource allocation gives rise to a usage-based 
cost; the cheapest tests are the ones used the most often in a given strategy. 
The testing cost of a strategy with optimal resource allocation is then (from 
standard arguments) 

(27) E [C tcst (T)} = -J2 Qx(T) logfe (T)) + Q(T) log(Q(T)/ R), 

x 

where Q(T) = J2x Qx{T). Furthermore, no postprocessing cost is taken into 
account, but we only allow complete strategies, so that the goal is to mini- 
mize (27) over complete strategies. In [6] we prove that for a hierarchy for 
which each attribute has at least two children and for which the powers are 
increasing with the resolution level, the CTF strategy is optimal if we as- 
sume that all tests have power greater than some constant f3\ = 7/8. While 
this (probably improvable) value is not entirely realistic as far as practical 
applications are concerned, we believe it is an important step in favor of 
CTF optimality for this cost model. 
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In addition, we argue that it makes sense in this latter framework to 
consider an extended scenario where repeated search tasks are undertaken 
for different sets of target patterns, whereas the resources are distributed 
in advance among all tests. While the set of targets changes from task to 
task, the individual attribute tests are reusable. The patterns are identified 
with conjunctions of abstract attributes at different resolution levels, taken 
from a possibly very large pool. Whereas the analysis in the fixed-cost model 
remains unchanged, there is a significant difference under usage-based cost 
since we must distribute the resources over a larger number of tests. In order 
to simplify the analysis, we suppose the set of target patterns y is random- 
ized for each new search task and again present some fairly mild sufficient 
conditions (about the dependence of power on resolution and the size of the 
attribute pools, e.g., exponential growth of pool size with resolution, and 
negative polynomial decrease of type II error) ensuring the optimality of 
CTF strategies. 

8. Applications to pattern recognition. In order to illustrate our frame- 
work for pattern recognition we present two types of results: First, we give 
a few examples of the scene interpretation problem and cite some previous 
work on a CTF strategy for object detection. Only pictures and references 
are provided. The purpose is merely to demonstrate the efficacy of the ap- 
proach in a real computational vision problem. Second, in order to illustrate 
numerically the quantities appearing in our analysis, and to check whether 
the cost model is reasonable in at least one concrete setting, we outline 
a more or less exact implementation, due to Franck Jung, of the pattern 
filtering design for a synthetic example introduced in [20] — detecting rect- 
angles amidst clutter. It was developed in order to automate cartography by 
detecting roofs of buildings in aerial photographs [21]. Only those aspects 
which shed light on the mathematical analysis are described; all the details 
may be found in [6]. 





Fig. 6. Left: a "natural" image. Right: group photograph used in an experiment on face 
detection. 
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8.1. Scene interpretation. Consider the scenes in Figure 6. The semantic 
interpretation of the left image (town, shops, pedestrians, etc.) is effortless 
for humans but far beyond what any artificial system can do. For the image 
on the right, the goal might be more modest — detect and localize the faces. 
Enriching the description with information about the precise pose (scale, 
orientation, etc.), identities or expressions would be more ambitious. Many 
methods have been proposed for face detection, including artificial neural 
networks [24], Gaussian models [26], support vector machines [22], Bayesian 
inference [9] and deformable templates [30]. 

To relate these tasks to the framework of this paper, imagine attempting 
to characterize a (randomly selected) subimage containing at most one ob- 
ject from a predetermined repertoire. (The whole scene can then be searched 
by a divide-and-conquer strategy; see Section 8.2 and [12].) The dominating 
explanation Y = corresponds to "background" or "clutter" and each of the 
others, Y G y, corresponds to the instantiation of an object wholly visible 
in the subimage. Even with only one (generic) object class, the number of 
possible instantiations is very large; that is, there is still considerable within- 
class variability. For instance, detecting a face at a fixed position, scale and 
orientation might not be terribly difficult, even given variations in lighting 
and nonlinear variations due to expressions; it can be accomplished with 
standard learning algorithms such as multilayer perceptrons, decision trees 
and support vector machines. However, the amount of computation required 
to do this separately for every possible pose is prohibitive. Instead, we pro- 
pose to search simultaneously for many instantiations, say over a range of 
locations, scales and orientations. In our simplified mathematical analysis, 
that range of poses is A = y, which is the "scope" of our coarsest test Xa- 
(It may not be practical to envision a totally invariant test, in which case 
there are multiple hierarchies.) 

This approach to scene interpretation has been shown to be highly ef- 
fective in practice. A version involving successive partitions of object/pose 
pairings, rank-based tests for the corresponding (classes of) hypotheses and 




Fig. 7. The detections (left) and ''density of work" fright ) for the group photo. 
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breadth-first CTF search appears in [19]. The detection results shown in 
Figure 7 were obtained by an algorithm [14] based on the strategy proposed 
here — traversing a multiresolution hierarchy of X binary hypothesis tests 
{Xa , A 6 A} , where each A represents a family of shapes with some com- 
mon properties and Xa is an image functional designed to detect shapes in 
this family. In the face detection experiments, A is a subset of affine poses 
and Xa is based on checking for special local features (e.g., edges) which 
are likely to be present for faces with poses in A. In fact, Xa can be inter- 
preted as a likelihood ratio test [3]. Recently, researchers in the computer 
vision community have started using similar methods for similar problems; 
see, for example, [25] and [29]. Ideas related to CTF processing have also 
been proposed by [15] in a Bayesian classification framework where a hier- 
archy of estimators is built for the posterior of recursively clustered classes. 
In Figure 7, the efficiency of sequential testing is illustrated for the group 
photo by counting, for each pixel, the amount of computation performed 
in its vicinity; clearly the spatial "density of work" is highly skewed. The 
corresponding density would be flat for nearly all other methods, that is, 
those based on multilayer perceptrons or support vector machines. 

8.2. Rectangle detection. The goal is to find and localize rectangles in 
a "scene" of the type shown in Figure 11. The generative model (which 
involves first inserting and degrading rectangles and then adding clutter) is 
described in [6]. 

There are many ways to find the rectangles. For instance, one could use 
any of the methods cited above for finding faces. For the artificial problem 
illustrated in Figure 11, with limited noise and clutter, it would not be 
surprising to obtain a decent solution with standard model-based or learning- 
based methods. Our intention is only to demonstrate how this might be done 
in an especially efficient manner with a sequential testing design. 

8.2.1. Problem formulation. It is clearly impossible to find common but 
localized attributes of two rectangles with significantly different (geometric) 
poses, say far apart in the scene. Here, the "pose" of a rectangle has four pa- 
rameters: orientation, center, height and length. Consequently, we divide the 
whole scene into nonover lapping 5x5 regions and apply a simple, "divide- 
and-conquer" strategy based on location. Each 5x5 region R is visited in 
order to determine if there is a rectangle in the scene whose distinguished 
point (say the center) lies in R; depending on its scale, the rectangle itself 
will enclose some portion of the scene surrounding R. We can assume that 
the scale of the rectangle is restricted to a given range whose lower end 
represents the smallest rectangles we attempt to find. Larger rectangles are 
found by repeatedly downsampling the image and parsing the scene in the 
same way; this is how the faces in Figure 6 were detected. (Similarly, the 
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orientation of the rectangle is restricted to a given range of angles; other ori- 
entations could be found by repeating the process with suitably transformed 
detectors.) 

Partitioning the scene into nonoverlapping regions and downsampling to 
handle scale can be thought of as the first two levels of a recursive parti- 
tioning of the full pose space. The loops over regions R and scales are the 
"parallel component" of the algorithm and not of interest here. The serial 
component is a CTF search to determine if there is a rectangle within a 
range of scales whose center lies in a fixed region R. This is the heart of the 
algorithm and the real source of efficient computation. The hypothesis Y = 
stands for "no rectangle with these parameters" and is evidently a complex 
mixture of configurations due to clutter, larger rectangles and nearby ones. 

8.2.2. Patterns, attributes and tests. In order to define the set of expla- 
nations y, we partition the (reference) pose space into small subsets. A 
"pattern" or "explanation" y G y is then a subset of poses at approximately 
the resolution of the pixel lattice. In fact, these subsets are, by definition, 
the cells at the finest layer of the attribute hierarchy — a recursive partition- 
ing of of the type used throughout the paper, yielding = {Ai^}- In this 
case Y represents the true pose at the pixel resolution. 

There are L = 6 levels which correspond to five splits: two (binary) on 
orientation, one (quaternary) on position and two (binary) on scale (one 
on height and one on length). In particular there are \y\ = 64 finest cells, 
each with resolution 1.25 pixels in location, two pixels in length and height, 
and 7r/16 radians in tilt. Let rji be the cardinality (scope) of attributes at 
resolution level I. The quaternary split happens to be the second one, and 



As in the references cited above, the tests Xa are extremely simple im- 
age functionals based on local features £ related to edges. Each test Xa 
is based on a threshold r = t(A) and a collection 5(^4) of these features 
(corresponding to varying positions, orientations and levels of resolution): 



Thus, evaluating Xa consists of checking for at least r features among a 
special ensemble dedicated to A. Actually, we build many tests of varying 
powers for each A G A, each one corresponding to a different collection S. 
Identifying S and r is a problem in statistical learning. We use a fairly 
simple procedure which is described in [6]. 

The cost c{Xa) is defined as the number of pixels involved in evaluating 
Xa, which is the number of pixels which participate in the definition of any 
£ G S(Xa)- Assuming no preprocessing other than extracting and storing all 



hence (771, ... ,7fo) = (64,32,8,4,2, 1). 
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Fig. 8. Cost vs. power curve for attributes of depth one and two. 



the edges in the scene (and no other shortcuts in evaluating a test), this is 
roughly proportional to the actual algorithmic cost in CPU terms. 

Recall our basic constraint: P(Xa = 1\Y 6 A) = 1 for every test Xa- 
In particular, we demand that Xa = 1 when the image data surround- 
ing R contains a rectangle whose pose belongs to A. Of course the test 
may also respond positively in the absence of such a rectangle, due to 
clutter and nearby rectangles; the likelihood of this happening is precisely 
I — fl(A) = P(Xa = l\Y = 0). Intuitively, we expect that high power will only 
be possible at low invariance (specific poses). The power (3(A) is estimated 
from large samples of randomly selected background subimages. 

In Figure 8 we plot cost versus power for the family of all tests generated 
for the root cell, A\, referred to as "cell 1," and one of its two daughter 
cells, referred to as "cell 2." Thus each point is a pair (/?, c(Xa,p) )• For the 
root cell we cannot make tests with arbitrarily large power, at least not with 
such simple functionals. The "best tests" are those which are not strictly 
dominated by another test with respect to both cost and power — basically 
the convex envelope of the whole family; plots are given in [6]. Plots for 
cells at other depths are very similar, and the convexity assumption made 
in Sections 5 and 6 seems to be roughly satisfied. 

Finally, one can ask whether the functional form of our global cost model, 
namely c(Xa,/3) = T(\A\) x ^(/9), is consistent with the data. This means an 
additive model for the log of the cost. In Figure 9 we plot the (base 2) log- 
arithm of cost against the (base 2) logarithm of rji for five selected powers. 
Each point is one test — the one with lowest cost among those with power 
very close to a selected value. The fact that the curves are roughly transla- 
tions of each other is consistent with the additive model for the log-cost. The 
roughly linear dependence of the log-cost with respect to logr(|j4|) suggests 
a power dependence as a first approximation (^(x) oc x a for some a G [0, 1]). 
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8.2.3. Detection results. We use the framework of Section 5 — power- 
based cost for a fixed hierarchy. More specifically, from all the "best tests" 
created, we extracted one for each cell A £ A such that all the powers and 
costs are (approximately) the same at each level, which yields one sequence 
(/9;,q), I = 1, ... ,6, which is increasing in both components and plotted in 
Figure 10 (left). Since the powers are increasing, the conditions of Corol- 
lary 2 are satisfied under the cost model. However, we need not assume that 
the cost model is valid; we can directly check whether (/3/,q) satisfies the 
hypotheses of Corollary 1. In Figure 10 (right) we show, level by level, the 
(logarithms of the) values representing the two sides of (16). Clearly the 
conditions of Corollary 1 are easily satisfied. 

The detection results for one scene are shown in Figure 11. In order to 
estimate total computation, we processed an 858 x 626 scene 100 times. The 
average time is 3.25 s on a Pentium 1.5 GHz. For comparison, we can perform 
an ideal hypothesis test for each fine cell (Y £ AQ t k,k = 1, ...,64) based 
on simply counting all the edges in the region generated by the union of 
silhouettes over the poses in Aq^ (a form of template-matching) and setting 
a threshold to obtain no false negatives. (This is a more discriminating test 
than Xa for a fine cell A because the latter uses only some of the edges.) The 
average processing time for this brute force approach is far larger (2338 s) but 
the results are virtually perfect. Finally, we can perform a two-stage analysis, 
first executing the CTF search and then doing the template-matching only 
at the detected poses. The processing time is virtually the same as for the 
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Fig. 9. Log-cost vs. log-invariance for various powers. 
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CTF search (about 3 s) but most of the false positives are removed; see 
Figure 11. 

9. Discussion and conclusion. There are many problems in machine learn- 
ing and perception which come down to differentiating among an enormous 
number of competing explanations, some very similar to each other and far 
too many to examine one-by-one. In these cases, efficient representations 
may be as important as statistical learning [18], and thinking about com- 
putation at the start of the day may be essential. It then seems prudent to 
model the computational process itself and hierarchical designs are a nat- 
ural way to do this. Moreover, there is plenty of evidence that this works 
in practice. On the mathematical side, the questions that naturally arise 




Fig. 10. Left: The pairs (/3i,ci) for the fixed hierarchy used in the experiments. Right; 
top curve: I — > log(C( x (ci +1 / fii +1 )) where Ci is the number of children of a node at level 
I; bottom curve: I — >log(ci//3i). The conditions of Corollary 1 are clearly satisfied. 
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Fig. 11. Example of a detection result; the small crosses indicate the detected locations. 
Left; CTF detection only. Notice there are scattered false positives. Right: CTF search 
followed by template-matching. Nearly all the false positives are removed with virtually no 
increase in computation. 
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from thinking about CTF representations and CTF search are of interest 
in themselves. We have provided one possible formulation; others could be 
envisioned. 

9.1. Decision trees. Of course "twenty questions," and the search strate- 
gies T studied here based on a fixed family X of binary tests, invoke decision 
(or classification) trees — adaptive procedures for discriminating amongst hy- 
potheses based on sequential testing. 

Most of the literature is about an inductive framework. Trees are induced 
from a training set of i.i.d. samples from the joint distribution of the feature 
vector and class label, and binary tests result from comparing one compo- 
nent of the feature vector to a threshold. A tree is built in a top-down, 
greedy, recursive fashion based on some splitting criterion, usually entropy 
reduction [7]. The construction is then data-driven and locally optimized, 
guided by uncertainty reduction. There is a large literature on application 
of decision trees to pattern recognition which is outside the scope of this 
paper; see [1]. 

Generally, efficient (online) execution is not a criterion for construction or 
performance; for instance, the CART algorithm does not account for mean 
path length, let alone "costs" for the tests. Not surprisingly, recursive greedy 
designs are often globally inefficient, for instance in terms of the mean depth 
necessary to reach a given classification rate. A rarely studied alternative 
is to begin with an explicit statistical model for features and labels and 
compute a tree according to a global criterion involving both accuracy and 
(online) computation. The construction is then model-driven and globally 
optimized. Our approach to calculating Y is of this general nature. 

We refer the reader to [6] for an expanded discussion of these issues, in- 
cluding some early work due to Garey [16] on optimal testing procedures; 
related strategies for image retrieval [28]; comparisons between depth-first 
CTF and vanilla CART [20], showing that, in general, the latter is not 
CTF; and a special (if unrealistic) case, traced back to [10] and at the in- 
tersection of sequential statistics [8], game theory [5] and adaptive control 
processes [4], in which globally optimal testing strategies can be computed 
using dynamic programming, at least for "small problems." (See also work on 
cost- minimizing sequential procedures for Markov decision processes in [23].) 
In this special case, some comparisons in accuracy (resp. mean depth) be- 
tween local and global strategies are given in [17] at a fixed mean depth 
(resp. accuracy), revealing an enormous difference in favor of global strate- 
gies, especially with skewed priors, that is, when a priori some classes are 
much more likely than others, which is precisely the situation in pattern 
recognition. 
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9.2. Open issues. Within our formulation there are some unanswered 
but fundamental mathematical questions and a few dubious assumptions. 
To begin with, we have divided the whole classification problem into two 
distinct and successive phases, first noncontextual (testing against nonspe- 
cific alternatives) and second contextual (testing one subset of explanations 
against another). We have shown that CTF search is effective, even optimal, 
in the first phase and preliminary results (not reported here) indicate the 
same is true of the second phase. However, whereas sensible, this division 
was artificially imposed; in particular, we have not shown that it emerges 
naturally from a global formulation of the problem. One might, for example, 
expand the family X = {Xa, A G .4} into a much larger family of hypothesis 
tests for testing Y G A versus Y G B for various subsets A, B and levels of 
error, and then attempt to prove that it is in fact computationally efficient 
to start with B = A c under some distributional assumptions, and reasonable 
trade-offs among scope, error and cost. 

Whereas our results on fixed-powers hierarchies are fairly comprehensive, 
the results on variable-power hierarchies are evidently not. What is special, 
if anything, about the "harmonic cost function"? Simulations suggest that 
the CTF is generically optimal but we have not been able to prove this in 
general. 

On the other hand, several of our model assumptions can be considered 
as too simplistic. Perhaps the cost model should be revisited; in simulations 
high power is not always attainable at high invariance (regardless of cost), at 
least for relatively simple tests (recall Figure 8). As pointed out earlier, sup- 
posing conditional independence under Pq is disputable. Ideally, one should 
examine nontrivial dependency structures for X, one appealing model being 
a first-order Markov structure of the tests as already depicted in the simula- 
tions of Section 5.4. Also, measuring computation under Pq only is suspect. 
At some point in the computational process, as evidence accumulates from 
positive test results for the presence of a pattern of interest, the background 
hypothesis ceases to be dominant and all the class-conditional distributions 
must enter the story. 

More ambitiously, an even more general optimization problem could be 
considered: Design the entire system including the subsets to be tested (not 
requiring a hierarchical structure a priori) as well as the levels of discrimina- 
tion. This would likely involve a dependency structure for overlapping tests. 
Some of these questions are currently being investigated. 

APPENDIX 

Proof of Theorem 4. Consider a given tree-structured hierarchy A. 
In this proof, we are mainly interested in the graph structure of A. Here 
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again it will be easier to consider the equivalent "augmented" model A (see 
Section 4.5.4), thereby assuming the original A has been extended one level 
by adding a single child to each original leaf (in order to accommodate the 
perfect tests i which are performed at the end of the search for all 

y E Y). Except for the power-one constraint for the final singleton tests: For 
any node s in a strategy tree, the assigned power (3(s) may be freely chosen 
independently of how it is chosen when the corresponding attribute A(s) 
appears at other nodes. Of course there must be no errors under Pq, but 
this is automatically satisfied by definition for any CTF strategy. 

To prove the theorem we will proceed by recurrence over the size of sub- 
hierarchies of A. We actually need slightly more general objects than con- 
ventional subhierarchies (i.e., subtrees). We will call TL a generalized sub- 
hierarchy if TL is a finite union of subhierarchies of A. The cardinality of TL 
is defined as the number of its nodes (internal or leaves). A CTF strategy for 
TL satisfies the usual hypothesis that an attribute is tested if and only if all of 
its ancestors in TL have been tested and returned a positive answer. Finally, 
for a node B of A, denote by TLb the generalized subhierarchy composed of 
all strict descendents of B, in other words the union of all the subhierarchies 
rooted in direct children of B. 

Now we prove by recurrence on the size c of generalized subhierarchies 
which have the following property: 

(P(c)) For any generalized subhierarchy TL of A of cardinality at most c, 
every CTF strategy with optimal choice of powers has the same cost 
C c tf (TL). Furthermore, for any node B £TL, the test Xb is always 
performed in such a CTF strategy with the same power (5b, and this 
value depends only on TLb, being therefore independent of the CTF 
strategy considered. Finally, ifTL is the union of several disjoint sub- 
hierarchies of A, then the CTF cost ofTL is the sum of the CTF costs 
of these subhierarchies. 

For c = 1, any generalized subhierarchy TL must be a single node (at- 
tribute) corresponding to a perfect test, in which case the property is trivial. 

Suppose (P(c)) is true and consider a generalized subhierarchy TL of car- 
dinality c+ 1. Let T be a CTF strategy for TL with optimally chosen pow- 
ers and let B be the attribute which is tested at the root of T; necessar- 
ily B has no ancestors in TL. Write TLb f° r the generalized subhierarchy 
TL\({B}UTL B ). 

If B is a leaf, then, by construction, its power is fixed to 1 and TLb = 0- 
Hence, after B is tested with power 1 (thus returning a null answer under 
Po); the remaining part of T is a CTF strategy for subhierarchy TLb, and 
therefore, by the hypothesis of recurrence, 



(28) 



E [C(T)] = *(l)+C ctt (H B ). 



HIERARCHICAL TESTING DESIGNS 



49 



Suppose now that B is not a leaf. If the test Xb = 0, the subsequent 
part of strategy T must be a CTF exploration, with optimal powers, of the 
subhierarchy Hb- Similarly, if Xb = 1, the subsequent part of T is a CTF 
strategy for TCbDTCb, a disjoint union. By Lemma 3 and the recurrence hy- 
pothesis concerning cost additivity over disjoint subhier archies, we therefore 
have 

E [C(T)] =C^(WB) + *r ( |B|)(Cctf(WflUWB)-^tf(Wfl)) 

(29) 

= C ct f (Hb) + $r(|_B|)(C c tf (Wb)). 

Furthermore, the second part of Lemma 3 shows that the optimal power 
chosen for Xb only depends on C ct f (Hb)- 

Property (P(c+ 1)) is now an immediate consequence of (28) and (29), 
which concludes the proof. □ 



PROOF of Theorem 6. Our goal is to prove the switching property 
(25) for the harmonic cost, that is, 

Vy > x > 0, Va > b > Z> a (x) + $ 6 ($ a (y) - * o (a0) > $ a(z + $&(2/ " 

This is obviously satisfied when x = y (for any choice of a and b). Denote 
by Cl(v,x) [resp. Cr(?/;x)] the left-hand (resp. right-hand) side of the above 
inequality. We will show that 

/nn\ dC L (y;x) dC R {y;x) 

(30) > for all y > x > 0, 

dy dy 

which will conclude the proof. Taking derivatives in (30) we obtain 

(31) *' 6 (*a(i/) - * a (*))K(v) > K(x + My - x))& b ( y - X) 

with 



b ^ 2 



K x + b, 

After some elementary algebra, we find that (31) is equivalent to 

(y - x)[(x + a) 2 - a 2 } > 0, 
which is true since y>x. This concludes the proof. □ 



Acknowledgment. We are grateful to Franck Jung for performing the 
experiments on rectangle detection and we refer the reader to his cited work 
for a convincing example of the "real thing" — the difficult task of detecting 
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